Welcome

Column

Why this project

  • Get back into the books and notes to refresh on concepts and software

  • Refresh on and practice using R/RStudio

  • Experiment with flexdashboards to see if I might want/how to incorporate this as part a workflow process

  • Work with a relevant data set

  • Learning new things about flexdashboards in R/Rmarkdown (Ex: using html for picture sizing & placement)
    Image Source: https://images.techhive.com/images/article/2016/09/data_science_classes-100682563-large.jpg

Column

Important Note(s)

  • This is best viewed on a wide-screen monitor.

  • After opening this file, expand or maximize the window to properly view it.

  • A small, or reduced, window size causes the top tabs to move to a second line in the header row. This collapses the page contents in a manner that hides various window/section headers, etc.

Experimental section

This section demonstrates showing a code block without the result(s).

About The Data

Column

Data Source

IBM HR Analytics Employee Attrition & Performance
Downloaded from: https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset

  • Data is fictional - created by IBM data scientists

  • Insight considerations (general):
    • Predict attrition
    • What factors contribute to attrition
    • Once identify factors contributing to attrition, deep-dive and/or comparisons to develop understanding of those factors

Data Sample (scrollable)

age attrition businesstravel dailyrate department distancefromhome education educationfield employeecount employeenumber environmentsatisfaction gender hourlyrate jobinvolvement joblevel jobrole jobsatisfaction maritalstatus monthlyincome monthlyrate numcompaniesworked over18 overtime percentsalaryhike performancerating relationshipsatisfaction standardhours stockoptionlevel totalworkingyears trainingtimeslastyear worklifebalance yearsatcompany yearsincurrentrole yearssincelastpromotion yearswithcurrmanager
41 1 travel_rarely 1102 sales 1 2 life sciences 1 1 2 female 94 3 2 sales executive 4 single 5993 19479 8 y yes 11 3 1 80 0 8 0 1 6 4 0 5
49 0 travel_frequently 279 research & development 8 1 life sciences 1 2 3 male 61 2 2 research scientist 2 married 5130 24907 1 y no 23 4 4 80 1 10 3 3 10 7 1 7
37 1 travel_rarely 1373 research & development 2 2 other 1 4 4 male 92 2 1 laboratory technician 3 single 2090 2396 6 y yes 15 3 2 80 0 7 3 3 0 0 0 0
33 0 travel_frequently 1392 research & development 3 4 life sciences 1 5 4 female 56 3 1 research scientist 3 married 2909 23159 1 y yes 11 3 3 80 0 8 3 3 8 7 3 0
27 0 travel_rarely 591 research & development 2 1 medical 1 7 1 male 40 3 1 laboratory technician 2 married 3468 16632 9 y no 12 3 4 80 1 6 3 3 2 2 2 2

Column

Data Overview

value
rows 1470
columns 35
discrete_columns 8
continuous_columns 27
all_missing_columns 0
total_missing_values 0
complete_rows 1470
total_observations 51450
memory_usage 378144

Data Dictionary (scrollable)

\(attrition\)
0 ‘No’ 1 ‘Yes’

\(education\)
1 ‘Below College’ 2 ‘College’ 3 ‘Bachelor’ 4 ‘Master’ 5 ‘Doctor’

\(environmentsatisfaction\)
1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

\(jobinvolvement\)
1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

\(joblevel\)
Ordinal levels represented by 1, 2, 3, 4, 5. No further meaning known.

\(jobsatisfaction\)
1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

\(performancerating\)
1 ‘Low’ 2 ‘Good’ 3 ‘Excellent’ 4 ‘Outstanding’

\(relationshipsatisfaction\)
1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

\(stockoptionlevel\)
Ordinal levels represented by 0, 1, 2, 3. No further meaning known.

\(worklifebalance\)
1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’


factor_level meaning
businesstravel1 non-travel
businesstravel2 travel_rarely
businesstravel3 travel_frequently
department1 human resources
department2 research & development
department3 sales
educationfield1 human resources
educationfield2 life sciences
educationfield3 marketing
educationfield4 medical
educationfield5 other
educationfield6 technical degree
gender1 female
gender2 male
jobrole1 healthcare representative
jobrole2 human resources
jobrole3 laboratory technician
jobrole4 manager
jobrole5 manufacturing director
jobrole6 research director
jobrole7 research scientist
jobrole8 sales executive
jobrole9 sales representative
maritalstatus1 single
maritalstatus2 married
maritalstatus3 divorced
over18 y
overtime1 no
overtime2 yes

Missing Values

Missing Values

Feature Distributions

Column

Continuous data distributions

Discrete data distributions

Correlations

Column

All-Data Correlations

Continuous Data Correlations

Discrete Data Correlations

Initial Observations/Notes

Column

Notes

  • Using a 70/30 train/test split for assessing model performance

  • Models to explore: Logistic regression (manual and step-wise), Sparse logistic regression

  • Although, \(joblevel\) and \(stockoptionlevel\) show as numbers, they represent distinct, ordered levels. As such, we will leave the values as quantitative values but will view them from the perspective of ordinal variables during interpretations for this analysis.

  • Refer to Data Exploration \(\rightarrow\) About The Data \(\rightarrow\) Data Dictionary section to see which variables were factored and their corresponding factor levels.

  • \(maritalstatus\) was factored as an ordinal variable


Observations

  • The following variables/predictors are not needed for analysis:
    • \(employeecount\) –> each observation represent a single employee
    • \(over18\) –> all employees are over 18
    • \(standardhours\) –> has only 1 unique value (80)
    • \(employeenumber\) –> not needed; simply a means of referencing
  • There are no missing values; therefore, no imputation or removal of instances is required

  • Multicolinearity observed - high correlations involving the following continuous variables could affect model:
    • \(age\) \(\rightarrow\) \(joblevel\), \(monthlyincome\), \(totalworkingyears\) and \(yearsatcompany\)
    • \(joblevel\) \(\rightarrow\) \(monthlyincome\), \(totalworkingyears\) and \(yearsatcompany\)
    • \(monthlyincome\) \(\rightarrow\) \(totalworkingyears\) and \(yearsatcompany\)
    • \(percentsalaryhike\) \(\rightarrow\) \(performancerating\)
    • \(totalworkingyears\) \(\rightarrow\) \(yearsatcompany\)
    • \(yearsatcompany\) \(\rightarrow\) \(yearsincurrentrole\), \(yearssincelastpromotion\) and \(yearswithcurrmanager\)
  • High correlations among categorical data levels will be ignored initially. I’m choosing to do this because:
    • I am not expanding the data set to include dummy variables for each category level. Doing so will increase the dimensionality.
    • The variable selection process may resolve the issue.
  • The data is unbalanced on the response variable \(attrition\)
    Freq
    0 1233
    1 237
  • In his book An Introduction to Categorical Data Analysis (2nd Ed.), Agresti discusses a guideline that suggests that there “…ideally be at least 10 outcomes of each type for every predictor.” This guideline indicates that there should be no more than 23-24 predictors in our final logistic regression model.

Saturated (Full) Model

Saturated (Full) Model

\[\begin{align} logit[P(attrition = 1 (Yes))] = &\beta_0 + \beta_1age + \beta_2businesstravel + \\ &\beta_3dailyrate + \beta_4department + \beta_5distancefromhome + \beta_6education + \\ &\beta_7educationfield + \beta_8environmentsatisfaction + \beta_9gender + \beta_{10}hourlyrate + \\ &\beta_{11}jobinvolvement + \beta_{12}joblevel + \beta_{13}jobrole + \beta_{14}jobsatisfaction + \\ &\beta_{15}maritalstatus + \beta_{16}monthlyincome + \beta_{17}monthlyrate + \beta_{18}numcompaniesworked + \\ &\beta_{19}overtime + \beta_{20}percentsalaryhike + \beta_{21}performancerating + \\ &\beta_{22}relationshipsatisfaction + \beta_{23}stockoptionlevel + \\ &\beta_{24}totalworkingyears + \beta_{25}trainingtimeslastyear + \\ &\beta_{26}worklifebalance + \beta_{27}yearsatcompany + \beta_{28}yearsincurrentrole + \\ &\beta_{29}yearssincelastpromotion + \beta_{30}yearswithcurrmanager \end{align}\]

Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.40085 604.81628 -0.02050 0.98364
age -0.02413 0.01627 -1.48349 0.13794
businesstraveltravel_rarely 1.31458 0.49256 2.66886 0.00761
businesstraveltravel_frequently 2.27719 0.53433 4.26177 2e-05
dailyrate -0.00038 0.00027 -1.42892 0.15303
departmentresearch & development 14.05655 604.81332 0.02324 0.98146
departmentsales 14.06676 604.81348 0.02326 0.98144
distancefromhome 0.03483 0.01299 2.68027 0.00736
education -0.04364 0.10747 -0.40609 0.68468
educationfieldlife sciences -0.18511 1.01477 -0.18241 0.85526
educationfieldmarketing 0.31086 1.06724 0.29128 0.77084
educationfieldmedical -0.20909 1.01499 -0.20600 0.83679
educationfieldother -0.49864 1.11872 -0.44572 0.6558
educationfieldtechnical degree 0.97273 1.03981 0.93549 0.34954
environmentsatisfaction -0.50942 0.10230 -4.97947 0
gendermale 0.39426 0.22275 1.76999 0.07673
hourlyrate 0.00539 0.00546 0.98715 0.32357
jobinvolvement -0.57517 0.14525 -3.95971 8e-05
joblevel -0.21105 0.38549 -0.54749 0.58404
jobrolehuman resources 16.06546 604.81381 0.02656 0.97881
jobrolelaboratory technician 1.58117 0.62439 2.53236 0.01133
jobrolemanager 0.42054 1.08537 0.38747 0.69841
jobrolemanufacturing director 0.41759 0.67944 0.61461 0.53881
jobroleresearch director -2.36486 1.42958 -1.65423 0.09808
jobroleresearch scientist 0.88143 0.62900 1.40132 0.16112
jobrolesales executive 1.12898 1.31113 0.86107 0.3892
jobrolesales representative 2.42566 1.38310 1.75379 0.07947
jobsatisfaction -0.33215 0.10019 -3.31520 0.00092
maritalstatusmarried -0.80895 0.30540 -2.64883 0.00808
maritalstatusdivorced -1.12109 0.42434 -2.64198 0.00824
monthlyincome 0.00008 0.00010 0.76904 0.44187
monthlyrate 0.00000 0.00002 0.29423 0.76858
numcompaniesworked 0.21586 0.04653 4.63894 0
overtimeyes 2.17380 0.24219 8.97541 0
percentsalaryhike -0.04615 0.04766 -0.96819 0.33295
performancerating 0.25050 0.49705 0.50397 0.61428
relationshipsatisfaction -0.24206 0.09985 -2.42427 0.01534
stockoptionlevel -0.18874 0.18727 -1.00785 0.31353
totalworkingyears -0.08158 0.03611 -2.25895 0.02389
trainingtimeslastyear -0.19218 0.08583 -2.23921 0.02514
worklifebalance -0.27068 0.15456 -1.75132 0.07989
yearsatcompany 0.12042 0.04726 2.54777 0.01084
yearsincurrentrole -0.20571 0.05674 -3.62563 0.00029
yearssincelastpromotion 0.16701 0.05117 3.26377 0.0011
yearswithcurrmanager -0.09943 0.05982 -1.66212 0.09649

Check X vs. Y Independence

Column

Check X vs. Y Independence

  • Here, we’ll check to see if there’s a relationship between (\(H_o: No\ relationship/independent)\)) each categorical variable and the response variable using contingency tables, \(\chi^2\), and \(p-values\). Where contingency tables have an ordinal variable w/ attrition (nominal - 2 levels) we will use the Cochran-Mantel-Haenszel (CMH) test, a linear trend test, since it will have more power. The results of the CMH test will be directly beneath the corresponding CrossTable and indicated by d.f. = 1.


  • Use \(\chi^2\) test for independence for the following predictors w/ attrition:
    • \(businesstravel\)
    • \(department\)
    • \(educationfield\)
    • \(gender\)
    • \(jobrole\)
    • \(maritalstatus\)
    • \(overtime\)


  • CMH test for the remaining categorical predictors vs. attrition


  • The following predictors are independent of the response (i.e. \(p-value > 0.05\)). We will remove these predictors from the model since the response does not depend on these predictors.
    • \(gender\)
    • \(relationshipsatisfaction\)
    • \(worklifebalance\)

Column

\(\chi^2\) tests (scrollable)


 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|-------------------------|

 
Total Observations in Table:  1029 

 
                  | attrition 
   businesstravel |         0 |         1 | Row Total | 
------------------|-----------|-----------|-----------|
       non-travel |        99 |         7 |       106 | 
                  |    88.591 |    17.409 |           | 
------------------|-----------|-----------|-----------|
    travel_rarely |       620 |       113 |       733 | 
                  |   612.614 |   120.386 |           | 
------------------|-----------|-----------|-----------|
travel_frequently |       141 |        49 |       190 | 
                  |   158.795 |    31.205 |           | 
------------------|-----------|-----------|-----------|
     Column Total |       860 |       169 |      1029 | 
------------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  20.13083     d.f. =  2     p =  4.252523e-05 


 




 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|-------------------------|

 
Total Observations in Table:  1029 

 
                       | attrition 
            department |         0 |         1 | Row Total | 
-----------------------|-----------|-----------|-----------|
       human resources |        33 |         8 |        41 | 
                       |    34.266 |     6.734 |           | 
-----------------------|-----------|-----------|-----------|
research & development |       583 |        91 |       674 | 
                       |   563.304 |   110.696 |           | 
-----------------------|-----------|-----------|-----------|
                 sales |       244 |        70 |       314 | 
                       |   262.430 |    51.570 |           | 
-----------------------|-----------|-----------|-----------|
          Column Total |       860 |       169 |      1029 | 
-----------------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  12.35835     d.f. =  2     p =  0.002072139 


 




 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|-------------------------|

 
Total Observations in Table:  1029 

 
                 | attrition 
  educationfield |         0 |         1 | Row Total | 
-----------------|-----------|-----------|-----------|
 human resources |        15 |         4 |        19 | 
                 |    15.879 |     3.121 |           | 
-----------------|-----------|-----------|-----------|
   life sciences |       360 |        63 |       423 | 
                 |   353.528 |    69.472 |           | 
-----------------|-----------|-----------|-----------|
       marketing |        94 |        27 |       121 | 
                 |   101.127 |    19.873 |           | 
-----------------|-----------|-----------|-----------|
         medical |       275 |        42 |       317 | 
                 |   264.937 |    52.063 |           | 
-----------------|-----------|-----------|-----------|
           other |        49 |         5 |        54 | 
                 |    45.131 |     8.869 |           | 
-----------------|-----------|-----------|-----------|
technical degree |        67 |        28 |        95 | 
                 |    79.397 |    15.603 |           | 
-----------------|-----------|-----------|-----------|
    Column Total |       860 |       169 |      1029 | 
-----------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  20.20982     d.f. =  5     p =  0.001141335 


 




 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|-------------------------|

 
Total Observations in Table:  1029 

 
             | attrition 
      gender |         0 |         1 | Row Total | 
-------------|-----------|-----------|-----------|
      female |       359 |        63 |       422 | 
             |   352.692 |    69.308 |           | 
-------------|-----------|-----------|-----------|
        male |       501 |       106 |       607 | 
             |   507.308 |    99.692 |           | 
-------------|-----------|-----------|-----------|
Column Total |       860 |       169 |      1029 | 
-------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  1.164534     d.f. =  1     p =  0.2805271 

Pearson's Chi-squared test with Yates' continuity correction 
------------------------------------------------------------
Chi^2 =  0.9872404     d.f. =  1     p =  0.3204178 

 




 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|-------------------------|

 
Total Observations in Table:  1029 

 
                          | attrition 
                  jobrole |         0 |         1 | Row Total | 
--------------------------|-----------|-----------|-----------|
healthcare representative |        83 |         5 |        88 | 
                          |    73.547 |    14.453 |           | 
--------------------------|-----------|-----------|-----------|
          human resources |        22 |         8 |        30 | 
                          |    25.073 |     4.927 |           | 
--------------------------|-----------|-----------|-----------|
    laboratory technician |       137 |        37 |       174 | 
                          |   145.423 |    28.577 |           | 
--------------------------|-----------|-----------|-----------|
                  manager |        74 |         4 |        78 | 
                          |    65.190 |    12.810 |           | 
--------------------------|-----------|-----------|-----------|
   manufacturing director |        84 |         7 |        91 | 
                          |    76.054 |    14.946 |           | 
--------------------------|-----------|-----------|-----------|
        research director |        60 |         1 |        61 | 
                          |    50.982 |    10.018 |           | 
--------------------------|-----------|-----------|-----------|
       research scientist |       182 |        39 |       221 | 
                          |   184.704 |    36.296 |           | 
--------------------------|-----------|-----------|-----------|
          sales executive |       183 |        43 |       226 | 
                          |   188.882 |    37.118 |           | 
--------------------------|-----------|-----------|-----------|
     sales representative |        35 |        25 |        60 | 
                          |    50.146 |     9.854 |           | 
--------------------------|-----------|-----------|-----------|
             Column Total |       860 |       169 |      1029 | 
--------------------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  63.88879     d.f. =  8     p =  8.000966e-11 


 




 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|-------------------------|

 
Total Observations in Table:  1029 

 
              | attrition 
maritalstatus |         0 |         1 | Row Total | 
--------------|-----------|-----------|-----------|
       single |       245 |        84 |       329 | 
              |   274.966 |    54.034 |           | 
--------------|-----------|-----------|-----------|
      married |       405 |        63 |       468 | 
              |   391.137 |    76.863 |           | 
--------------|-----------|-----------|-----------|
     divorced |       210 |        22 |       232 | 
              |   193.897 |    38.103 |           | 
--------------|-----------|-----------|-----------|
 Column Total |       860 |       169 |      1029 | 
--------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  31.01857     d.f. =  2     p =  1.838245e-07 


 




 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|-------------------------|

 
Total Observations in Table:  1029 

 
             | attrition 
    overtime |         0 |         1 | Row Total | 
-------------|-----------|-----------|-----------|
          no |       664 |        78 |       742 | 
             |   620.136 |   121.864 |           | 
-------------|-----------|-----------|-----------|
         yes |       196 |        91 |       287 | 
             |   239.864 |    47.136 |           | 
-------------|-----------|-----------|-----------|
Column Total |       860 |       169 |      1029 | 
-------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  67.73148     d.f. =  1     p =  1.873485e-16 

Pearson's Chi-squared test with Yates' continuity correction 
------------------------------------------------------------
Chi^2 =  66.19615     d.f. =  1     p =  4.082086e-16 

 

CMH tests (scrollable)


 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|-------------------------|

 
Total Observations in Table:  1029 

 
             | attrition 
   education |         0 |         1 | Row Total | 
-------------|-----------|-----------|-----------|
           1 |        96 |        25 |       121 | 
             |       101 |        20 |           | 
-------------|-----------|-----------|-----------|
           2 |       160 |        33 |       193 | 
             |       161 |        32 |           | 
-------------|-----------|-----------|-----------|
           3 |       341 |        73 |       414 | 
             |       346 |        68 |           | 
-------------|-----------|-----------|-----------|
           4 |       234 |        35 |       269 | 
             |       225 |        44 |           | 
-------------|-----------|-----------|-----------|
           5 |        29 |         3 |        32 | 
             |        27 |         5 |           | 
-------------|-----------|-----------|-----------|
Column Total |       860 |       169 |      1029 | 
-------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  5.528328     d.f. =  4     p =  0.2372506 


 
     Chisq         Df       Prob 
4.36088337 1.00000000 0.03677323 




 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|-------------------------|

 
Total Observations in Table:  1029 

 
                        | attrition 
environmentsatisfaction |         0 |         1 | Row Total | 
------------------------|-----------|-----------|-----------|
                      1 |       144 |        55 |       199 | 
                        |       166 |        33 |           | 
------------------------|-----------|-----------|-----------|
                      2 |       190 |        30 |       220 | 
                        |       184 |        36 |           | 
------------------------|-----------|-----------|-----------|
                      3 |       262 |        48 |       310 | 
                        |       259 |        51 |           | 
------------------------|-----------|-----------|-----------|
                      4 |       264 |        36 |       300 | 
                        |       251 |        49 |           | 
------------------------|-----------|-----------|-----------|
           Column Total |       860 |       169 |      1029 | 
------------------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  23.95468     d.f. =  3     p =  2.553014e-05 


 
       Chisq           Df         Prob 
1.602041e+01 1.000000e+00 6.266329e-05 




 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|-------------------------|

 
Total Observations in Table:  1029 

 
               | attrition 
jobinvolvement |         0 |         1 | Row Total | 
---------------|-----------|-----------|-----------|
             1 |        43 |        23 |        66 | 
               |        55 |        11 |           | 
---------------|-----------|-----------|-----------|
             2 |       212 |        48 |       260 | 
               |       217 |        43 |           | 
---------------|-----------|-----------|-----------|
             3 |       513 |        89 |       602 | 
               |       503 |        99 |           | 
---------------|-----------|-----------|-----------|
             4 |        92 |         9 |       101 | 
               |        84 |        17 |           | 
---------------|-----------|-----------|-----------|
  Column Total |       860 |       169 |      1029 | 
---------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  22.44157     d.f. =  3     p =  5.278864e-05 


 
       Chisq           Df         Prob 
1.856557e+01 1.000000e+00 1.641587e-05 




 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|-------------------------|

 
Total Observations in Table:  1029 

 
             | attrition 
    joblevel |         0 |         1 | Row Total | 
-------------|-----------|-----------|-----------|
           1 |       284 |       102 |       386 | 
             |       323 |        63 |           | 
-------------|-----------|-----------|-----------|
           2 |       327 |        40 |       367 | 
             |       307 |        60 |           | 
-------------|-----------|-----------|-----------|
           3 |       126 |        20 |       146 | 
             |       122 |        24 |           | 
-------------|-----------|-----------|-----------|
           4 |        76 |         3 |        79 | 
             |        66 |        13 |           | 
-------------|-----------|-----------|-----------|
           5 |        47 |         4 |        51 | 
             |        43 |         8 |           | 
-------------|-----------|-----------|-----------|
Column Total |       860 |       169 |      1029 | 
-------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  48.98864     d.f. =  4     p =  5.870761e-10 


 
       Chisq           Df         Prob 
3.199786e+01 1.000000e+00 1.543427e-08 




 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|-------------------------|

 
Total Observations in Table:  1029 

 
                | attrition 
jobsatisfaction |         0 |         1 | Row Total | 
----------------|-----------|-----------|-----------|
              1 |       156 |        41 |       197 | 
                |       165 |        32 |           | 
----------------|-----------|-----------|-----------|
              2 |       157 |        32 |       189 | 
                |       158 |        31 |           | 
----------------|-----------|-----------|-----------|
              3 |       269 |        55 |       324 | 
                |       271 |        53 |           | 
----------------|-----------|-----------|-----------|
              4 |       278 |        41 |       319 | 
                |       267 |        52 |           | 
----------------|-----------|-----------|-----------|
   Column Total |       860 |       169 |      1029 | 
----------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  5.834937     d.f. =  3     p =  0.1199229 


 
     Chisq         Df       Prob 
5.20628004 1.00000000 0.02250544 




 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|-------------------------|

 
Total Observations in Table:  1029 

 
                         | attrition 
relationshipsatisfaction |         0 |         1 | Row Total | 
-------------------------|-----------|-----------|-----------|
                       1 |       158 |        45 |       203 | 
                         |       170 |        33 |           | 
-------------------------|-----------|-----------|-----------|
                       2 |       180 |        30 |       210 | 
                         |       176 |        34 |           | 
-------------------------|-----------|-----------|-----------|
                       3 |       277 |        47 |       324 | 
                         |       271 |        53 |           | 
-------------------------|-----------|-----------|-----------|
                       4 |       245 |        47 |       292 | 
                         |       244 |        48 |           | 
-------------------------|-----------|-----------|-----------|
            Column Total |       860 |       169 |      1029 | 
-------------------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  6.46917     d.f. =  3     p =  0.09088631 


 
    Chisq        Df      Prob 
2.3512264 1.0000000 0.1251845 




 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|-------------------------|

 
Total Observations in Table:  1029 

 
                 | attrition 
stockoptionlevel |         0 |         1 | Row Total | 
-----------------|-----------|-----------|-----------|
               0 |       332 |       109 |       441 | 
                 |       369 |        72 |           | 
-----------------|-----------|-----------|-----------|
               1 |       375 |        41 |       416 | 
                 |       348 |        68 |           | 
-----------------|-----------|-----------|-----------|
               2 |       104 |         9 |       113 | 
                 |        94 |        19 |           | 
-----------------|-----------|-----------|-----------|
               3 |        49 |        10 |        59 | 
                 |        49 |        10 |           | 
-----------------|-----------|-----------|-----------|
    Column Total |       860 |       169 |      1029 | 
-----------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  41.07117     d.f. =  3     p =  6.315821e-09 


 
       Chisq           Df         Prob 
2.017610e+01 1.000000e+00 7.062985e-06 




 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|-------------------------|

 
Total Observations in Table:  1029 

 
                | attrition 
worklifebalance |         0 |         1 | Row Total | 
----------------|-----------|-----------|-----------|
              1 |        36 |        16 |        52 | 
                |        43 |         9 |           | 
----------------|-----------|-----------|-----------|
              2 |       199 |        41 |       240 | 
                |       201 |        39 |           | 
----------------|-----------|-----------|-----------|
              3 |       541 |        93 |       634 | 
                |       530 |       104 |           | 
----------------|-----------|-----------|-----------|
              4 |        84 |        19 |       103 | 
                |        86 |        17 |           | 
----------------|-----------|-----------|-----------|
   Column Total |       860 |       169 |      1029 | 
----------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  9.601838     d.f. =  3     p =  0.02227229 


 
     Chisq         Df       Prob 
3.05963351 1.00000000 0.08025977 

Check VIFs

Model m2 showing the removal of the three predictors from the previous section

\[\begin{align} logit[P(attrition = 1 (Yes))] = &\beta_0 + \beta_1age + \beta_2businesstravel + \\ &\beta_3dailyrate + \beta_4department + \beta_5distancefromhome + \beta_6education + \\ &\beta_7educationfield + \beta_8environmentsatisfaction + \beta_{10}hourlyrate + \\ &\beta_{11}jobinvolvement + \beta_{12}joblevel + \beta_{13}jobrole + \beta_{14}jobsatisfaction + \\ &\beta_{15}maritalstatus + \beta_{16}monthlyincome + \beta_{17}monthlyrate + \beta_{18}numcompaniesworked + \\ &\beta_{19}overtime + \beta_{20}percentsalaryhike + \beta_{21}performancerating + \\ &\beta_{23}stockoptionlevel + \beta_{24}totalworkingyears + \beta_{25}trainingtimeslastyear + \\ &\beta_{27}yearsatcompany + \beta_{28}yearsincurrentrole + \\ &\beta_{29}yearssincelastpromotion + \beta_{30}yearswithcurrmanager \end{align}\]

GVIF Df GVIF^(1/(2*Df))
age 1.840417e+00 1 1.356620
businesstravel 1.184785e+00 2 1.043302
dailyrate 1.066514e+00 1 1.032722
department 4.323419e+07 2 81.088047
distancefromhome 1.096092e+00 1 1.046944
education 1.116183e+00 1 1.056496
educationfield 3.469739e+00 5 1.132478
environmentsatisfaction 1.126510e+00 1 1.061371
hourlyrate 1.056816e+00 1 1.028016
jobinvolvement 1.092947e+00 1 1.045441
joblevel 1.068985e+01 1 3.269534
jobrole 2.961597e+08 8 3.384312
jobsatisfaction 1.118621e+00 1 1.057649
maritalstatus 2.274122e+00 2 1.228014
monthlyincome 1.108998e+01 1 3.330161
monthlyrate 1.079802e+00 1 1.039135
numcompaniesworked 1.380134e+00 1 1.174791
overtime 1.264672e+00 1 1.124576
percentsalaryhike 2.756273e+00 1 1.660203
performancerating 2.712538e+00 1 1.646978
stockoptionlevel 2.093548e+00 1 1.446910
totalworkingyears 4.932473e+00 1 2.220917
trainingtimeslastyear 1.054267e+00 1 1.026775
yearsatcompany 6.151852e+00 1 2.480293
yearsincurrentrole 2.695909e+00 1 1.641922
yearssincelastpromotion 2.319255e+00 1 1.522910
yearswithcurrmanager 3.091729e+00 1 1.758331

Reduced Model (m2)

Column

Reduced model m2

  • Now that we’ve identified an initial set of variables to remove, we arrive at a reduced model (m2) in the form of:

\[\begin{align} logit[P(attrition = 1 (Yes))] = &\beta_0 + \beta_1age + \beta_2businesstravel + \\ &\beta_3dailyrate + \beta_5distancefromhome + \beta_6education + \\ &\beta_7educationfield + \beta_8environmentsatisfaction + \beta_{10}hourlyrate + \\ &\beta_{11}jobinvolvement + \beta_{14}jobsatisfaction + \\ &\beta_{15}maritalstatus + \beta_{17}monthlyrate + \beta_{18}numcompaniesworked + \\ &\beta_{19}overtime + \beta_{20}percentsalaryhike + \beta_{21}performancerating + \\ &\beta_{23}stockoptionlevel + \beta_{24}totalworkingyears + \beta_{25}trainingtimeslastyear + \\ &\beta_{28}yearsincurrentrole + \\ &\beta_{29}yearssincelastpromotion + \beta_{30}yearswithcurrmanager \end{align}\]

  • Does the model (m2) fit?
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.89762 1.59473 1.81700 0.06922
age -0.02393 0.01548 -1.54582 0.12215
businesstraveltravel_rarely 1.32695 0.47871 2.77193 0.00557
businesstraveltravel_frequently 2.21691 0.51385 4.31433 2e-05
dailyrate -0.00027 0.00026 -1.06635 0.28626
distancefromhome 0.03094 0.01229 2.51831 0.01179
education -0.10312 0.10155 -1.01549 0.30987
educationfieldlife sciences -0.87567 0.73925 -1.18454 0.2362
educationfieldmarketing -0.06232 0.76849 -0.08109 0.93537
educationfieldmedical -0.99789 0.74980 -1.33088 0.18323
educationfieldother -1.26278 0.89331 -1.41360 0.15748
educationfieldtechnical degree 0.26081 0.77526 0.33641 0.73656
environmentsatisfaction -0.50315 0.09647 -5.21562 0
hourlyrate 0.00348 0.00511 0.68159 0.4955
jobinvolvement -0.64914 0.14001 -4.63634 0
jobsatisfaction -0.27984 0.09371 -2.98619 0.00282
maritalstatusmarried -0.70602 0.28583 -2.47005 0.01351
maritalstatusdivorced -1.05249 0.40122 -2.62320 0.00871
monthlyrate 0.00000 0.00001 0.08527 0.93205
numcompaniesworked 0.17224 0.04265 4.03846 5e-05
overtimeyes 1.83452 0.21577 8.50199 0
percentsalaryhike -0.02972 0.04459 -0.66642 0.50515
performancerating 0.21818 0.47009 0.46412 0.64256
stockoptionlevel -0.13817 0.17670 -0.78195 0.43425
totalworkingyears -0.08632 0.02514 -3.43405 0.00059
trainingtimeslastyear -0.17493 0.08120 -2.15439 0.03121
yearsincurrentrole -0.13663 0.05123 -2.66700 0.00765
yearssincelastpromotion 0.17277 0.04674 3.69683 0.00022
yearswithcurrmanager -0.02938 0.05232 -0.56151 0.57445
Residual.Deviance Residual.df
642.7284 1000
  • We see that model M2 fits by \(\frac{deviance_{res}}{df_{res}} \leq 1\). Check the marginal model plots to check model validity and to see if any of the continuous predictors are misspecified. If a predictor is misspecified, check the conditional density plots to see what kind of transformation may be needed.

Column

Marginal Model Plots (scrollable)

  • Check the marginal model plots (mmp) of the quantitative predictors to see if the model is specified correctly

Conditional Density Plots (scrollable)

  • \(age\) appears to be misspecified.

  • \(age\) appears to generally have a normal distribution and the same/similar variance for both values of \(attrition\) (i.e., yes and no). Let’s try adding a quadratic term to the model for \(age\).

Reduced Model (m3)

Column

Model m3

  • Here we will include a quadratic term in the model for \(age\) and arrive at model (m3) in the form of:

\[\begin{align} logit[P(attrition = 1 (Yes))] = &\beta_0 + \beta_1age + \beta_{1a}age^2 + \beta_2businesstravel + \\ &\beta_3dailyrate + \beta_5distancefromhome + \beta_6education + \\ &\beta_7educationfield + \beta_8environmentsatisfaction + \beta_{10}hourlyrate + \\ &\beta_{11}jobinvolvement + \beta_{14}jobsatisfaction + \\ &\beta_{15}maritalstatus + \beta_{17}monthlyrate + \beta_{18}numcompaniesworked + \\ &\beta_{19}overtime + \beta_{20}percentsalaryhike + \beta_{21}performancerating + \\ &\beta_{23}stockoptionlevel + \beta_{24}totalworkingyears + \beta_{25}trainingtimeslastyear + \\ &\beta_{28}yearsincurrentrole + \\ &\beta_{29}yearssincelastpromotion + \beta_{30}yearswithcurrmanager \end{align}\]

  • Does the model (m3) fit?
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.29298 2.07533 3.51412 0.00044
age -0.28164 0.07867 -3.58011 0.00034
I(age^2) 0.00338 0.00101 3.36209 0.00077
businesstraveltravel_rarely 1.29460 0.48524 2.66794 0.00763
businesstraveltravel_frequently 2.21435 0.52003 4.25815 2e-05
dailyrate -0.00031 0.00026 -1.18928 0.23433
distancefromhome 0.03005 0.01236 2.43047 0.01508
education -0.04823 0.10426 -0.46265 0.64361
educationfieldlife sciences -0.92361 0.74617 -1.23780 0.21579
educationfieldmarketing -0.08924 0.77513 -0.11513 0.90834
educationfieldmedical -1.03216 0.75642 -1.36453 0.1724
educationfieldother -1.28829 0.90164 -1.42884 0.15305
educationfieldtechnical degree 0.15087 0.78271 0.19275 0.84715
environmentsatisfaction -0.50753 0.09726 -5.21854 0
hourlyrate 0.00382 0.00515 0.74058 0.45895
jobinvolvement -0.65153 0.14125 -4.61264 0
jobsatisfaction -0.27098 0.09469 -2.86171 0.00421
maritalstatusmarried -0.65875 0.28968 -2.27407 0.02296
maritalstatusdivorced -0.97599 0.40609 -2.40337 0.01624
monthlyrate 0.00000 0.00001 -0.09178 0.92687
numcompaniesworked 0.17913 0.04298 4.16796 3e-05
overtimeyes 1.84479 0.21882 8.43059 0
percentsalaryhike -0.03638 0.04525 -0.80396 0.42142
performancerating 0.26208 0.47515 0.55158 0.58123
stockoptionlevel -0.13170 0.17904 -0.73558 0.46199
totalworkingyears -0.08600 0.02482 -3.46514 0.00053
trainingtimeslastyear -0.17986 0.08211 -2.19050 0.02849
yearsincurrentrole -0.12795 0.05153 -2.48310 0.01302
yearssincelastpromotion 0.16528 0.04672 3.53777 4e-04
yearswithcurrmanager -0.01267 0.05313 -0.23840 0.81157
Residual.Deviance Residual.df
631.7444 999
  • We see that model m3 fits by \(\frac{deviance_{res}}{df_{res}} \leq 1\). Check the marginal model plots.

Column

Marginal Model Plots (scrollable)

  • Check the marginal model plots (mmp) of the quantitative predictors to see if the model is specified correctly

Observations on model m3

  • Adding the \(age^2\) term corrects some misspecification in the model.

  • Look at the standardized deviance residuals and inspect for outliers and bad leverage points

Leverage (Model m3)

Column

Leverage Plot

Observations

  • There doesn’t appear to be any bad leverage points, although there are several points of high leverage.

  • There does appear to be outliers in the data.

  • Since this is a simulated dataset, we will assume that there is sufficient reason for removing the outliers.

Column

Number of outliers

34

Outlier indices

6, 11, 17, 38, 47, 62, 68, 82, 154, 184, 204, 264, 309, 321, 352, 364, 390, 558, 561, 606, 609, 660, 668, 680, 685, 693, 720, 730, 755, 786, 817, 837, 876 and 910

Outlier data (scrollable)

age attrition businesstravel dailyrate department distancefromhome education educationfield environmentsatisfaction gender hourlyrate jobinvolvement joblevel jobrole jobsatisfaction maritalstatus monthlyincome monthlyrate numcompaniesworked overtime percentsalaryhike performancerating relationshipsatisfaction stockoptionlevel totalworkingyears trainingtimeslastyear worklifebalance yearsatcompany yearsincurrentrole yearssincelastpromotion yearswithcurrmanager
28 1 travel_rarely 890 research & development 2 4 medical 3 male 46 3 1 research scientist 3 single 4382 16374 6 no 17 3 4 0 5 3 2 2 2 2 1
44 1 travel_rarely 935 research & development 3 3 life sciences 1 male 89 3 1 laboratory technician 1 married 2362 14669 4 no 12 3 3 0 10 4 4 3 2 1 2
32 1 non-travel 1474 sales 11 4 other 4 male 60 4 2 sales executive 3 married 4707 23914 8 no 12 3 4 0 6 2 3 4 2 1 2
39 1 travel_rarely 1162 sales 3 2 medical 4 female 41 3 2 sales executive 3 married 5238 17778 4 yes 18 3 1 0 12 3 2 1 0 0 0
39 1 travel_rarely 360 research & development 23 3 medical 3 male 93 3 1 research scientist 1 single 3904 22154 0 no 13 3 1 0 6 2 3 5 2 0 3
36 1 travel_rarely 530 sales 3 1 life sciences 3 male 51 2 3 sales executive 4 married 10325 5518 1 yes 11 3 1 1 16 6 3 16 7 3 7
27 1 travel_rarely 1420 sales 2 1 marketing 3 male 85 3 1 sales representative 1 divorced 3041 16346 0 no 11 3 2 1 5 3 3 4 3 0 2
21 1 travel_rarely 1427 research & development 18 1 other 4 female 65 3 1 research scientist 4 single 2693 8870 1 no 19 3 1 0 1 3 2 1 0 0 0
53 1 travel_rarely 607 research & development 2 5 technical degree 3 female 78 2 3 manufacturing director 4 married 10169 14618 0 no 16 3 2 1 34 4 3 33 7 1 9
44 1 travel_rarely 1376 human resources 1 2 medical 2 male 91 2 3 human resources 1 married 10482 2326 9 no 14 3 4 1 24 1 3 20 6 3 6
35 1 travel_frequently 880 sales 12 4 other 4 male 36 3 2 sales executive 4 single 4581 10414 3 yes 24 4 1 0 13 2 4 11 9 6 7
46 1 travel_rarely 669 sales 9 2 medical 3 male 64 2 3 sales executive 4 single 9619 13596 1 no 16 3 4 0 9 3 3 9 8 4 7
39 1 travel_frequently 203 research & development 2 3 life sciences 1 male 84 3 4 healthcare representative 4 divorced 12169 13547 7 no 11 3 4 3 21 4 3 18 7 11 5
38 1 travel_rarely 903 research & development 2 3 medical 3 male 81 3 2 manufacturing director 2 married 4855 7653 4 no 11 3 1 2 7 2 3 5 2 1 4
44 1 travel_rarely 621 research & development 15 3 medical 1 female 73 3 3 healthcare representative 4 married 7978 14075 1 no 11 3 4 1 10 2 3 10 7 0 5
34 1 travel_frequently 234 research & development 9 4 life sciences 4 male 93 3 2 laboratory technician 1 married 5346 6208 4 no 17 3 3 1 11 3 2 7 1 0 7
30 1 travel_rarely 740 sales 1 3 life sciences 2 male 64 2 2 sales executive 1 married 9714 5323 1 no 11 3 4 1 10 4 3 10 8 6 7
52 1 travel_rarely 723 research & development 8 4 medical 3 male 85 2 2 research scientist 2 married 4941 17747 2 no 15 3 1 0 11 3 2 8 2 7 7
48 1 travel_frequently 708 sales 7 2 medical 4 female 95 3 1 sales representative 3 married 2655 11740 2 yes 11 3 3 2 19 3 3 9 7 7 7
58 1 travel_rarely 147 research & development 23 4 medical 4 female 94 3 3 healthcare representative 4 married 10312 3465 1 no 12 3 4 1 40 3 2 40 10 15 6
31 1 travel_frequently 1445 research & development 1 5 life sciences 3 female 100 4 3 manufacturing director 2 single 7446 8931 1 no 11 3 1 0 10 2 3 10 8 4 7
52 1 travel_rarely 266 sales 2 1 marketing 1 female 57 1 5 manager 4 married 19845 25846 1 no 15 3 4 1 33 3 3 32 14 6 9
41 1 non-travel 906 research & development 5 2 life sciences 1 male 95 2 1 research scientist 1 divorced 2107 20293 6 no 17 3 1 1 5 2 1 1 0 0 0
41 1 travel_rarely 1360 research & development 12 3 technical degree 2 female 49 3 5 research director 3 married 19545 16280 1 no 12 3 4 0 23 0 3 22 15 15 8
46 1 travel_rarely 377 sales 9 3 marketing 1 male 52 3 3 sales executive 4 divorced 10096 15986 4 no 11 3 1 1 28 1 4 7 7 4 3
46 1 travel_rarely 1254 sales 10 3 life sciences 3 female 64 3 3 sales executive 2 married 7314 14011 5 no 21 4 3 3 14 2 3 8 7 0 7
49 1 travel_rarely 1184 sales 11 3 marketing 3 female 43 3 3 sales executive 4 married 7654 5860 1 no 18 3 1 2 9 3 4 9 8 7 7
39 1 travel_rarely 895 sales 5 3 technical degree 4 male 56 3 2 sales representative 4 married 2086 3335 3 no 14 3 3 1 19 6 4 1 0 0 0
40 1 non-travel 1479 sales 24 3 life sciences 2 female 100 4 4 sales executive 2 single 13194 17071 4 yes 16 3 4 0 22 2 2 1 0 0 0
33 1 travel_rarely 465 research & development 2 2 life sciences 1 female 39 3 1 laboratory technician 1 married 2707 21509 7 no 20 4 1 0 13 3 4 9 7 1 7
44 1 travel_frequently 920 research & development 24 3 life sciences 4 male 43 3 1 laboratory technician 3 divorced 3161 19920 3 yes 22 4 4 1 19 0 1 1 0 0 0
55 1 travel_rarely 725 research & development 2 3 medical 4 male 78 3 5 manager 1 married 19859 21199 5 yes 13 3 4 1 24 2 3 5 2 1 4
30 1 travel_rarely 945 sales 9 3 medical 2 male 89 3 1 sales representative 4 single 1081 16019 1 no 13 3 3 0 1 3 2 1 0 0 0
24 1 travel_rarely 984 research & development 17 2 life sciences 4 female 97 3 1 laboratory technician 2 married 2210 3372 1 no 13 3 1 1 1 3 1 1 0 0 0

Outliers Removed (Model m3)

Column

Model m3 w/o outliers

  • Does the model fit?
Estimate Std. Error z value Pr(>|z|)
(Intercept) 11.08086 2.88543 3.84027 0.00012
age -0.37035 0.10612 -3.48975 0.00048
I(age^2) 0.00440 0.00137 3.20619 0.00135
businesstraveltravel_rarely 2.47822 0.75830 3.26814 0.00108
businesstraveltravel_frequently 4.10756 0.82111 5.00244 0
dailyrate -0.00068 0.00035 -1.93914 0.05248
distancefromhome 0.06318 0.01701 3.71407 2e-04
education 0.00557 0.13620 0.04088 0.96739
educationfieldlife sciences -1.58314 0.95850 -1.65169 0.0986
educationfieldmarketing -0.35779 0.99758 -0.35866 0.71985
educationfieldmedical -1.87098 0.97454 -1.91986 0.05488
educationfieldother -3.10154 1.26781 -2.44638 0.01443
educationfieldtechnical degree 0.04804 1.00214 0.04794 0.96177
environmentsatisfaction -0.91278 0.14094 -6.47650 0
hourlyrate -0.00180 0.00685 -0.26285 0.79266
jobinvolvement -1.22807 0.20255 -6.06299 0
jobsatisfaction -0.43131 0.12783 -3.37418 0.00074
maritalstatusmarried -1.35908 0.40263 -3.37548 0.00074
maritalstatusdivorced -1.38510 0.54032 -2.56348 0.01036
monthlyrate 0.00000 0.00002 0.20644 0.83645
numcompaniesworked 0.30306 0.05912 5.12593 0
overtimeyes 3.13624 0.33275 9.42510 0
percentsalaryhike -0.04956 0.06186 -0.80120 0.42302
performancerating 0.49344 0.64629 0.76349 0.44517
stockoptionlevel -0.21974 0.23932 -0.91819 0.35852
totalworkingyears -0.22275 0.04092 -5.44291 0
trainingtimeslastyear -0.28519 0.11009 -2.59051 0.00958
yearsincurrentrole -0.24845 0.07696 -3.22814 0.00125
yearssincelastpromotion 0.23844 0.06847 3.48225 5e-04
yearswithcurrmanager 0.08868 0.07987 1.11028 0.26688
Residual.Deviance Residual.df
366.584 965
  • We see that model m3 without outliers in the data still fits by \(\frac{deviance_{res}}{df_{res}} \leq 1\). Check the marginal model plots.

Column

Marginal Model Plots (scrollable)

  • Check the marginal model plots (mmp) of the quantitative predictors to see if the model is specified correctly

Observations on model m3

  • After specifying the predictors correctly earlier, we saw that the linear fit of the model could, potentially, still improve. Therefore, we inspected the leverage and looked for outliers.

  • We identified 34 outliers in the training set, and, for the purpose of this analysis, removed those outliers under the assumption that there was sufficient reason to do so. Recall, this is a simulated data set.

  • After removing the outliers from the training set and re-running model m3 with the new data, we find that the overall linear fit of the model improved greatly.

  • There still appears to be several predictors that are not significant, though. Let’s run step-wise variable selection to see if we can further reduce the number of predictors in the model.

Variable Selection (Model m3)

Column

Fwd/Bwd Stepwise Selection

Stepwise Model Path 
Analysis of Deviance Table

Initial Model:
attrition ~ age + I(age^2) + businesstravel + dailyrate + distancefromhome + 
    education + educationfield + environmentsatisfaction + hourlyrate + 
    jobinvolvement + jobsatisfaction + maritalstatus + monthlyrate + 
    numcompaniesworked + overtime + percentsalaryhike + performancerating + 
    stockoptionlevel + totalworkingyears + trainingtimeslastyear + 
    yearsincurrentrole + yearssincelastpromotion + yearswithcurrmanager

Final Model:
attrition ~ age + I(age^2) + businesstravel + dailyrate + distancefromhome + 
    educationfield + environmentsatisfaction + jobinvolvement + 
    jobsatisfaction + maritalstatus + numcompaniesworked + overtime + 
    totalworkingyears + trainingtimeslastyear + yearsincurrentrole + 
    yearssincelastpromotion


                    Step Df    Deviance Resid. Df Resid. Dev      AIC
1                                             965   366.5840 426.5840
2            - education  1 0.001671317       966   366.5857 424.5857
3          - monthlyrate  1 0.041329610       967   366.6270 422.6270
4           - hourlyrate  1 0.062737836       968   366.6898 420.6898
5    - performancerating  1 0.541018729       969   367.2308 419.2308
6    - percentsalaryhike  1 0.110578520       970   367.3414 417.3414
7     - stockoptionlevel  1 0.976428369       971   368.3178 416.3178
8 - yearswithcurrmanager  1 1.012913703       972   369.3307 415.3307

Column

Observations

  • Step-wise variable selection on model m3 removes seven variables.

  • All seven variables removed from the model were not previously significant.

  • Interestingly, \(dailyrate\) was not removed from the model even though it was not statistically significant before.

Final Logistic Regr Model

Final Logistic Regression Model

\[\begin{align} logit[P(attrition = 1 (Yes))] = &\beta_0 + \beta_1age + \beta_{1a}age^2 + \beta_2businesstravel + \\ &\beta_3dailyrate + \beta_5distancefromhome + \\ &\beta_7educationfield + \beta_8environmentsatisfaction + \\ &\beta_{11}jobinvolvement + \beta_{14}jobsatisfaction + \\ &\beta_{15}maritalstatus + \beta_{18}numcompaniesworked + \\ &\beta_{19}overtime + \beta_{24}totalworkingyears + \beta_{25}trainingtimeslastyear + \\ &\beta_{28}yearsincurrentrole + \\ &\beta_{29}yearssincelastpromotion \end{align}\]

Estimate Std. Error z value Pr(>|z|)
(Intercept) 11.86865 2.34774 5.05535 0
age -0.36314 0.10282 -3.53188 0.00041
I(age^2) 0.00426 0.00133 3.20066 0.00137
businesstraveltravel_rarely 2.44736 0.75063 3.26041 0.00111
businesstraveltravel_frequently 4.06929 0.81015 5.02287 0
dailyrate -0.00064 0.00035 -1.86346 0.0624
distancefromhome 0.06069 0.01662 3.65113 0.00026
educationfieldlife sciences -1.76682 0.91870 -1.92317 0.05446
educationfieldmarketing -0.53888 0.95366 -0.56507 0.57203
educationfieldmedical -2.07068 0.92676 -2.23431 0.02546
educationfieldother -3.11794 1.23414 -2.52640 0.01152
educationfieldtechnical degree -0.13154 0.95904 -0.13716 0.89091
environmentsatisfaction -0.90788 0.13996 -6.48653 0
jobinvolvement -1.19419 0.19810 -6.02811 0
jobsatisfaction -0.42537 0.12569 -3.38423 0.00071
maritalstatusmarried -1.60791 0.32068 -5.01405 0
maritalstatusdivorced -1.68362 0.40700 -4.13666 4e-05
numcompaniesworked 0.29720 0.05879 5.05517 0
overtimeyes 3.13669 0.32874 9.54165 0
totalworkingyears -0.20686 0.03877 -5.33509 0
trainingtimeslastyear -0.29303 0.10945 -2.67735 0.00742
yearsincurrentrole -0.20462 0.06522 -3.13754 0.0017
yearssincelastpromotion 0.26748 0.06570 4.07117 5e-05
Residual.Deviance Residual.df
369.3307 972

SLR Model

Column

Notes

  • Before applying the Lasso method, we consider the following:
    • \(gender\), \(relationshipsatisfaction\) and \(worklifebalance\) are removed from the model because of their independence from the response variable (\(attrition\)). See Logistic Regression \(\rightarrow\) Check X vs. Y Independence.
    • The 31 outliers identified in section Logistic Regression \(\rightarrow\) Leverage were removed from the training data set prior to reduce the effects of outliers on the model.
    • \(department\), \(joblevel\), \(jobrole\), \(monthlyincome\), and \(yearsatcompany\) were removed because of high VIF/GVIF values. See Logistic Regression \(\rightarrow\) Check VIFs.
    • The quadratic term \(age^2\) is added to the model because we saw earlier that \(age\) is misspecified in the model. See Logistic Regression \(\rightarrow\) Reduced Model(m2) and Reduced Model(m3) sections.
    • For the remaining predictor terms, we will apply the Lasso method for variable selection.
    • The starting model for SLR (lasso) is the same as model m3. See Logistic Regression \(\rightarrow\) Reduced Model(m3).


Model before applying SLR (lasso)

\[\begin{align} logit[P(attrition = 1 (Yes))] = &\beta_0 + \beta_1age + \beta_{1a}age^2 + \beta_2businesstravel + \\ &\beta_3dailyrate + \beta_5distancefromhome + \beta_6education + \\ &\beta_7educationfield + \beta_8environmentsatisfaction + \beta_{10}hourlyrate + \\ &\beta_{11}jobinvolvement + \beta_{14}jobsatisfaction + \\ &\beta_{15}maritalstatus + \beta_{17}monthlyrate + \beta_{18}numcompaniesworked + \\ &\beta_{19}overtime + \beta_{20}percentsalaryhike + \beta_{21}performancerating + \\ &\beta_{23}stockoptionlevel + \beta_{24}totalworkingyears + \beta_{25}trainingtimeslastyear + \\ &\beta_{28}yearsincurrentrole + \\ &\beta_{29}yearssincelastpromotion + \beta_{30}yearswithcurrmanager \end{align}\]

Sparse Logistic Regression (SLR) Model Fit

Column

SLR Coefficients

coef_for_lambda_min coef_for_lambda_1se
(Intercept) 3.65983374026907 -0.421810470939189
age -0.24524238227586 -0.0191318358586904
age_sq 0.00283512351833354 0
businesstravel 1.56532271480022 0.817320668532357
dailyrate -0.000601615403144011 -0.0002201357094025
distancefromhome 0.0491562155026073 0.0190447607580739
education 0.00592522968043428 0
educationfield 0.129915310062553 0.0127507615619703
environmentsatisfaction -0.752816272518327 -0.408482786862819
hourlyrate -0.00238420670969245 0
jobinvolvement -1.06144472857087 -0.620078680375977
jobsatisfaction -0.416612406296805 -0.198859784912876
maritalstatus -0.602899629258942 -0.405845309268576
monthlyrate 3.79991366887273e-06 0
numcompaniesworked 0.270628341008726 0.108657243387995
overtime 2.73062139115611 1.83780606141179
percentsalaryhike -0.041483014849234 0
performancerating 0.358197380191492 0
stockoptionlevel -0.354735806848068 -0.1823727134757
totalworkingyears -0.208366172960455 -0.10611210161498
trainingtimeslastyear -0.247066913046405 -0.0700495155574719
yearsincurrentrole -0.191047848229074 -0.0619798594440491
yearssincelastpromotion 0.206531430074502 0.0053199843505578
yearswithcurrmanager 0.0439659692332794 0

Logistic Regression Comparisons

Column

Final Logistic Regr Coefficients by Step-wise Var Select

Estimate Std. Error z value Pr(>|z|)
(Intercept) 11.86865 2.34774 5.05535 0
age -0.36314 0.10282 -3.53188 0.00041
I(age^2) 0.00426 0.00133 3.20066 0.00137
businesstraveltravel_rarely 2.44736 0.75063 3.26041 0.00111
businesstraveltravel_frequently 4.06929 0.81015 5.02287 0
dailyrate -0.00064 0.00035 -1.86346 0.0624
distancefromhome 0.06069 0.01662 3.65113 0.00026
educationfieldlife sciences -1.76682 0.91870 -1.92317 0.05446
educationfieldmarketing -0.53888 0.95366 -0.56507 0.57203
educationfieldmedical -2.07068 0.92676 -2.23431 0.02546
educationfieldother -3.11794 1.23414 -2.52640 0.01152
educationfieldtechnical degree -0.13154 0.95904 -0.13716 0.89091
environmentsatisfaction -0.90788 0.13996 -6.48653 0
jobinvolvement -1.19419 0.19810 -6.02811 0
jobsatisfaction -0.42537 0.12569 -3.38423 0.00071
maritalstatusmarried -1.60791 0.32068 -5.01405 0
maritalstatusdivorced -1.68362 0.40700 -4.13666 4e-05
numcompaniesworked 0.29720 0.05879 5.05517 0
overtimeyes 3.13669 0.32874 9.54165 0
totalworkingyears -0.20686 0.03877 -5.33509 0
trainingtimeslastyear -0.29303 0.10945 -2.67735 0.00742
yearsincurrentrole -0.20462 0.06522 -3.13754 0.0017
yearssincelastpromotion 0.26748 0.06570 4.07117 5e-05

Column

Final Logistic Regr Coef by Lasso Var Select

coef_for_lambda_min coef_for_lambda_1se
(Intercept) 3.65983374026907 -0.421810470939189
age -0.24524238227586 -0.0191318358586904
age_sq 0.00283512351833354 0
businesstravel 1.56532271480022 0.817320668532357
dailyrate -0.000601615403144011 -0.0002201357094025
distancefromhome 0.0491562155026073 0.0190447607580739
education 0.00592522968043428 0
educationfield 0.129915310062553 0.0127507615619703
environmentsatisfaction -0.752816272518327 -0.408482786862819
hourlyrate -0.00238420670969245 0
jobinvolvement -1.06144472857087 -0.620078680375977
jobsatisfaction -0.416612406296805 -0.198859784912876
maritalstatus -0.602899629258942 -0.405845309268576
monthlyrate 3.79991366887273e-06 0
numcompaniesworked 0.270628341008726 0.108657243387995
overtime 2.73062139115611 1.83780606141179
percentsalaryhike -0.041483014849234 0
performancerating 0.358197380191492 0
stockoptionlevel -0.354735806848068 -0.1823727134757
totalworkingyears -0.208366172960455 -0.10611210161498
trainingtimeslastyear -0.247066913046405 -0.0700495155574719
yearsincurrentrole -0.191047848229074 -0.0619798594440491
yearssincelastpromotion 0.206531430074502 0.0053199843505578
yearswithcurrmanager 0.0439659692332794 0

RF Model

Column

Random Forest Model

  • We’ll use the whole training set for the Random Forest model. It should be less affected by colinearity.

  • We’ll use the training set to identify important variables to keep and then re-run a more sparse model before measuring performance.

  • We’ll also build two initial models based on: 1) training set w/ outliers and 2) training set w/o outliers

  • For the plots to the right:
    • Green - class error for class 1 (i.e., \(attrition\) = 1 (yes))
    • Red - class error for class 0 (i.e., \(attrition\) = 0 (no))
    • Black - out of bag error
    • NOTE: we see lower error for class 0 because there are more “No” responses to learn from in the data

Column

RF Model Plot (w/ outliers)

RF Model Plot (w/o outliers)

Variable Importance (w/ Outliers)

Column

Variable Importance (w/ outliers) - Mean Decr in Accuracy

Variable Importance Plot (w/ outliers) - Mean Decr in Accuracy

Column

Variable Importance (w/ outliers) - Mean Decr in Node Impurity

Variable Importance Plot (w/ outliers) - Mean Decr in Node Impurity

Variable Importance (w/o Outliers)

Column

Variable Importance (w/o outliers) - Mean Decr in Accuracy

Variable Importance Plot (w/o outliers) - Mean Decr in Accuracy

Column

Variable Importance (w/o outliers) - Mean Decr in Node Impurity

Variable Importance Plot (w/0 outliers) - Mean Decr in Node Impurity

Observations

Observations

  • To compare bar plots, I chose to focus on predictor variables with a Mean Decrease in Accuracy \(\geq 5\) for identifying important variables and to find a potentially more sparse model.

  • Based on the selected cut-off above, we will focus on the following as the most important since they agree for both random forest models, regardless of training data containing or not containing outliers:
    • \(age\)
    • \(environmentsatisfaction\)
    • \(joblevel\)
    • \(jobrole\)
    • \(maritalstatus\)
    • \(monthlyincome\)
    • \(overtime\)
    • \(stockoptionlevel\)
    • \(totalworkingyears\)
    • \(yearsatcompany\)
  • Recall the variables noted as having correlations (colinearity) from the Initial Observations/Notes section. Although randomForest can handle data with colinearities, we see here that several correlated variables were given high importance. Particularly, consider the correlated relationships among \(age\), \(joblevel\), \(monthlyincome\), \(totalworkingyears\), and \(yearsatcompany\).

  • Let’s rerun an RF model on data that does not have the following variables:
    • \(totalworkingyears\)
      • because a company may or may not know this info
      • we’re assuming that the total number of years a person has been working is irrelevant regarding attrition (i.e., you can quit at any time, you can get another offer at any time, you can be fired at any time…all of which don’t necessarily account for how long you’ve been in the workforce.)
    • \(yearsatcompany\)
      • this is more of an ‘umbrella’ measure that can overlap with other variables it’s correlated with (ex. \(yearswithcurrentmanager\))
      • correlated with \(monthlyincome\)
    • \(monthlyincome\)
      • it’s correlated with \(age\) and \(joblevel\)
      • it’s reasonable to expect that \(monthlyincome\) will be greater with a higher \(age\) and/or \(joblevel\)

Variable Importance 2 (w/ Outliers)

Column

Variable Importance (w/ outliers) - Mean Decr in Accuracy

Variable Importance Plot (w/ outliers) - Mean Decr in Accuracy

Column

Variable Importance (w/ outliers) - Mean Decr in Node Impurity

Variable Importance Plot (w/ outliers) - Mean Decr in Node Impurity

Variable Importance 2 (w/o Outliers)

Column

Variable Importance (w/o outliers) - Mean Decr in Accuracy

Variable Importance Plot (w/o outliers) - Mean Decr in Accuracy

Column

Variable Importance (w/o outliers) - Mean Decr in Node Impurity

Variable Importance Plot (w/0 outliers) - Mean Decr in Node Impurity

Observations 2

Observations

  • Again, to compare bar plots, I chose to focus on predictor variables with a Mean Decrease in Accuracy \(\geq 5\) for identifying important variables and to find a potentially more sparse model.

  • Based on the selected cut-offs above, we will focus on the following as the most important based on the Mean Decrease in Accuracy, of which these variables appeared in both RF model where data contained outliers and the RF model where the data did not contain outliers:
    • \(age\)
    • \(educationfield\)
    • \(environmentsatisfaction\)
    • \(jobinvolvement\)
    • \(joblevel\)
    • \(jobrole\)
    • \(maritalstatus\)
    • \(numcompaniesworked\)
    • \(overtime\)
    • \(stockoptionlevel\)
    • \(yearsincurrentrole\)
  • Now, let’s develop a sparse random forest model that uses only the 11 most important variables we just identified.

RF Reduced Model

Column

RF Reduced Model Plot (w/ outliers)

RF Reduced Model Plot (w/o outliers)

Model Performance

Column

Performance - test set

AUC Correct Classification Rate Misclassification Rate
Logistic Regr 0.811543920517271 0.87075 0.12925
SLR (lambda.min) 0.816905850812178 0.85488 0.14512
SLR (lambda.1se) 0.797035167954585 0.86168 0.13832
RF (Saturated Model)* 0.822504336855386 0.86168 0.13832
RF (Saturated Model) 0.820690742785049 0.86621 0.13379
RF (Reduced Model)* 0.82069074278505 0.86395 0.13605
RF (Reduced Model) 0.823115439205173 0.86848 0.13152
* Training set contained outliers during model building.

Column

Receiver Operating Curves (ROCs)

Predictions

Predicted Probabilities & Class

test_data_index attrition Logistic Regr Prob Logistic Regr Class SLR (lambda_min) Prob SLR (lambda_min) Class SLR (lambda_1se) Prob SLR (lambda_1se) Class RF_full_w/_outliers Prob RF_full_w/_outliers Class RF_full_w/o_outliers Prob RF_full_w/o_outliers Class RF_reduced_w/_outliers Prob RF_reduced_w/_outliers Class RF_reduced_w/o_outliers Prob RF_reduced_w/o_outliers Class
1 1 0.1337551 0 0.4187221 0 0.4508243 0 0.357 0 0.339 0 0.357 0 0.355 0
2 0 0.0045126 0 0.0083202 0 0.0472341 0 0.066 0 0.074 0 0.093 0 0.077 0
3 0 0.0270882 0 0.0334978 0 0.0608848 0 0.272 0 0.224 0 0.296 0 0.208 0
4 0 0.0345040 0 0.0621377 0 0.1246250 0 0.062 0 0.042 0 0.057 0 0.041 0
5 0 0.0064818 0 0.0142927 0 0.0536165 0 0.410 0 0.347 0 0.423 0 0.338 0
6 0 0.0001915 0 0.0022330 0 0.0182444 0 0.091 0 0.065 0 0.092 0 0.053 0
7 0 0.1184722 0 0.1005978 0 0.1512744 0 0.294 0 0.210 0 0.301 0 0.211 0
8 0 0.0072444 0 0.0185718 0 0.0716059 0 0.113 0 0.082 0 0.117 0 0.069 0
9 0 0.0008074 0 0.0191949 0 0.0888132 0 0.080 0 0.036 0 0.067 0 0.044 0
10 1 0.3216325 0 0.4173543 0 0.3328851 0 0.353 0 0.319 0 0.389 0 0.349 0
11 0 0.8813008 1 0.8918318 1 0.7215041 1 0.319 0 0.334 0 0.327 0 0.350 0
12 0 0.0585817 0 0.1364132 0 0.1505066 0 0.187 0 0.127 0 0.190 0 0.131 0
13 0 0.1095135 0 0.2457727 0 0.2106443 0 0.114 0 0.138 0 0.139 0 0.126 0
14 0 0.2009834 0 0.2346135 0 0.3191659 0 0.035 0 0.021 0 0.035 0 0.026 0
15 0 0.1199910 0 0.3371876 0 0.2492579 0 0.091 0 0.067 0 0.112 0 0.056 0
16 0 0.0017555 0 0.0032567 0 0.0309785 0 0.031 0 0.024 0 0.040 0 0.024 0
17 0 0.0037252 0 0.0582621 0 0.1277394 0 0.110 0 0.077 0 0.113 0 0.080 0
18 0 0.0781689 0 0.1094613 0 0.1650881 0 0.166 0 0.110 0 0.189 0 0.122 0
19 0 0.0006637 0 0.0010312 0 0.0106097 0 0.208 0 0.192 0 0.236 0 0.191 0
20 0 0.2381318 0 0.0984095 0 0.0927205 0 0.353 0 0.290 0 0.365 0 0.278 0
21 0 0.0053633 0 0.0038909 0 0.0183735 0 0.095 0 0.045 0 0.097 0 0.046 0
22 0 0.0108929 0 0.0196967 0 0.0497142 0 0.146 0 0.102 0 0.180 0 0.091 0
23 0 0.0070470 0 0.0719895 0 0.1058973 0 0.172 0 0.157 0 0.163 0 0.137 0
24 0 0.0000030 0 0.0000095 0 0.0007603 0 0.179 0 0.070 0 0.174 0 0.075 0
25 0 0.0351221 0 0.0557567 0 0.1261415 0 0.102 0 0.105 0 0.138 0 0.096 0
26 0 0.0264458 0 0.0446655 0 0.1381170 0 0.185 0 0.173 0 0.180 0 0.163 0
27 0 0.0112697 0 0.0307188 0 0.0613374 0 0.110 0 0.109 0 0.114 0 0.085 0
28 0 0.0028556 0 0.0045635 0 0.0385197 0 0.144 0 0.092 0 0.107 0 0.103 0
29 0 0.0005014 0 0.0012573 0 0.0178767 0 0.079 0 0.018 0 0.063 0 0.016 0
30 0 0.0007068 0 0.0046479 0 0.0173095 0 0.108 0 0.030 0 0.093 0 0.031 0
31 0 0.0586538 0 0.0230449 0 0.0534774 0 0.143 0 0.071 0 0.148 0 0.072 0
32 0 0.0075223 0 0.0110632 0 0.0773356 0 0.125 0 0.051 0 0.123 0 0.092 0
33 1 0.1908698 0 0.1574158 0 0.2026731 0 0.164 0 0.132 0 0.142 0 0.148 0
34 0 0.4100984 0 0.4752593 0 0.2932294 0 0.226 0 0.197 0 0.231 0 0.219 0
35 0 0.0351379 0 0.0123723 0 0.0818762 0 0.230 0 0.202 0 0.228 0 0.214 0
36 1 0.1919437 0 0.3031714 0 0.2444309 0 0.366 0 0.232 0 0.413 0 0.243 0
37 0 0.0105224 0 0.0223195 0 0.0662741 0 0.178 0 0.144 0 0.157 0 0.137 0
38 0 0.0859349 0 0.0365784 0 0.0975079 0 0.221 0 0.194 0 0.228 0 0.198 0
39 0 0.0046028 0 0.0071235 0 0.0466816 0 0.159 0 0.094 0 0.137 0 0.098 0
40 0 0.0086627 0 0.0244027 0 0.0640974 0 0.099 0 0.040 0 0.095 0 0.027 0
41 0 0.0093444 0 0.0388963 0 0.0796283 0 0.050 0 0.030 0 0.042 0 0.019 0
42 0 0.0029099 0 0.0038009 0 0.0256727 0 0.118 0 0.031 0 0.105 0 0.044 0
43 0 0.0360528 0 0.1121357 0 0.2111306 0 0.309 0 0.327 0 0.303 0 0.329 0
44 0 0.0008743 0 0.0011603 0 0.0126282 0 0.059 0 0.018 0 0.067 0 0.009 0
45 0 0.0114729 0 0.0287557 0 0.0527610 0 0.171 0 0.126 0 0.177 0 0.126 0
46 0 0.0001516 0 0.0008493 0 0.0068252 0 0.115 0 0.048 0 0.121 0 0.054 0
47 0 0.0328435 0 0.0494563 0 0.1402950 0 0.195 0 0.104 0 0.223 0 0.131 0
48 0 0.0404809 0 0.0253462 0 0.0600043 0 0.182 0 0.189 0 0.196 0 0.184 0
49 0 0.0134792 0 0.0264300 0 0.0732834 0 0.144 0 0.086 0 0.125 0 0.077 0
50 0 0.0078715 0 0.0297649 0 0.0729885 0 0.075 0 0.044 0 0.087 0 0.046 0
51 0 0.0007378 0 0.0021073 0 0.0104622 0 0.113 0 0.025 0 0.076 0 0.022 0
52 0 0.0012921 0 0.0070192 0 0.0530667 0 0.022 0 0.013 0 0.031 0 0.008 0
53 0 0.0024030 0 0.0117787 0 0.0162441 0 0.117 0 0.057 0 0.131 0 0.051 0
54 0 0.0026386 0 0.0025510 0 0.0210910 0 0.125 0 0.048 0 0.130 0 0.057 0
55 1 0.3284143 0 0.4939918 0 0.4094639 0 0.228 0 0.219 0 0.216 0 0.201 0
56 1 0.0288190 0 0.0126501 0 0.0386043 0 0.153 0 0.103 0 0.158 0 0.101 0
57 1 0.0024981 0 0.0166987 0 0.0295823 0 0.174 0 0.138 0 0.170 0 0.148 0
58 0 0.0225045 0 0.0327220 0 0.0979431 0 0.064 0 0.067 0 0.065 0 0.071 0
59 1 0.8471049 1 0.6223316 1 0.4343002 0 0.310 0 0.307 0 0.317 0 0.323 0
60 0 0.0004397 0 0.0020089 0 0.0062553 0 0.134 0 0.057 0 0.125 0 0.082 0
61 0 0.0009878 0 0.0017079 0 0.0213123 0 0.070 0 0.041 0 0.078 0 0.032 0
62 0 0.0389992 0 0.1883261 0 0.1707194 0 0.197 0 0.083 0 0.187 0 0.108 0
63 0 0.0200781 0 0.0655895 0 0.1637667 0 0.119 0 0.071 0 0.083 0 0.052 0
64 0 0.0897300 0 0.2285908 0 0.0906802 0 0.269 0 0.204 0 0.267 0 0.178 0
65 0 0.0000024 0 0.0000114 0 0.0009173 0 0.077 0 0.013 0 0.063 0 0.019 0
66 0 0.0265708 0 0.0218304 0 0.0274346 0 0.275 0 0.261 0 0.288 0 0.247 0
67 0 0.0013741 0 0.0050014 0 0.0393164 0 0.160 0 0.075 0 0.141 0 0.051 0
68 1 0.3336486 0 0.3327704 0 0.4128249 0 0.313 0 0.334 0 0.354 0 0.348 0
69 0 0.0046679 0 0.0058214 0 0.0276992 0 0.117 0 0.060 0 0.087 0 0.050 0
70 0 0.0027885 0 0.0051027 0 0.0313399 0 0.168 0 0.108 0 0.172 0 0.094 0
71 0 0.0212682 0 0.0376949 0 0.0917933 0 0.039 0 0.008 0 0.043 0 0.012 0
72 1 0.0163098 0 0.0245831 0 0.0654238 0 0.152 0 0.103 0 0.164 0 0.095 0
73 0 0.0000321 0 0.0000938 0 0.0049418 0 0.044 0 0.009 0 0.031 0 0.010 0
74 0 0.0075065 0 0.0276098 0 0.0598509 0 0.259 0 0.150 0 0.255 0 0.151 0
75 0 0.0001282 0 0.0006631 0 0.0078769 0 0.090 0 0.051 0 0.093 0 0.037 0
76 0 0.1166397 0 0.0694775 0 0.1247120 0 0.201 0 0.148 0 0.201 0 0.157 0
77 0 0.0401757 0 0.0549363 0 0.1314545 0 0.246 0 0.207 0 0.210 0 0.201 0
78 0 0.0000267 0 0.0001233 0 0.0029547 0 0.059 0 0.054 0 0.066 0 0.059 0
79 1 0.1957583 0 0.3580698 0 0.1771279 0 0.114 0 0.104 0 0.126 0 0.100 0
80 0 0.0006776 0 0.0035939 0 0.0347680 0 0.064 0 0.028 0 0.059 0 0.028 0
81 0 0.0159722 0 0.0243066 0 0.1001524 0 0.094 0 0.073 0 0.090 0 0.077 0
82 0 0.9448432 1 0.9664750 1 0.7495044 1 0.211 0 0.237 0 0.217 0 0.243 0
83 0 0.0000451 0 0.0001158 0 0.0053106 0 0.026 0 0.027 0 0.044 0 0.030 0
84 1 0.5987490 1 0.6246445 1 0.4462352 0 0.368 0 0.388 0 0.366 0 0.363 0
85 0 0.0034845 0 0.0056455 0 0.0445092 0 0.079 0 0.054 0 0.061 0 0.046 0
86 0 0.4357993 0 0.2672650 0 0.0966984 0 0.218 0 0.185 0 0.214 0 0.181 0
87 0 0.0498006 0 0.0314534 0 0.0725392 0 0.333 0 0.223 0 0.318 0 0.248 0
88 0 0.0027430 0 0.0016405 0 0.0145365 0 0.075 0 0.011 0 0.075 0 0.004 0
89 0 0.0050480 0 0.0096339 0 0.0363607 0 0.080 0 0.062 0 0.088 0 0.057 0
90 0 0.0195947 0 0.0144163 0 0.0241114 0 0.195 0 0.087 0 0.178 0 0.076 0
91 0 0.0000012 0 0.0000029 0 0.0008156 0 0.039 0 0.004 0 0.032 0 0.005 0
92 0 0.0058197 0 0.0079097 0 0.0613316 0 0.069 0 0.061 0 0.086 0 0.041 0
93 0 0.0817179 0 0.0605762 0 0.1678354 0 0.188 0 0.198 0 0.206 0 0.209 0
94 0 0.0054617 0 0.0050526 0 0.0168789 0 0.121 0 0.045 0 0.085 0 0.050 0
95 0 0.0953937 0 0.0869177 0 0.1771020 0 0.274 0 0.279 0 0.271 0 0.315 0
96 0 0.0192653 0 0.0632697 0 0.0712673 0 0.171 0 0.176 0 0.189 0 0.128 0
97 0 0.2406020 0 0.0936175 0 0.1472987 0 0.178 0 0.159 0 0.158 0 0.169 0
98 0 0.4717975 0 0.4318131 0 0.4251381 0 0.380 0 0.423 0 0.427 0 0.438 0
99 0 0.1507289 0 0.0652438 0 0.1997076 0 0.141 0 0.136 0 0.152 0 0.127 0
100 0 0.0000592 0 0.0002153 0 0.0055337 0 0.065 0 0.008 0 0.063 0 0.010 0
101 0 0.0026949 0 0.0042893 0 0.0236133 0 0.073 0 0.050 0 0.074 0 0.052 0
102 0 0.0008899 0 0.0083315 0 0.0148236 0 0.096 0 0.025 0 0.070 0 0.038 0
103 0 0.0019166 0 0.0005646 0 0.0121800 0 0.085 0 0.053 0 0.087 0 0.047 0
104 0 0.3869147 0 0.4860831 0 0.4589995 0 0.106 0 0.094 0 0.099 0 0.096 0
105 0 0.0218355 0 0.0200540 0 0.0718735 0 0.127 0 0.125 0 0.140 0 0.113 0
106 0 0.0000892 0 0.0007260 0 0.0145025 0 0.020 0 0.008 0 0.023 0 0.012 0
107 0 0.0878294 0 0.0471688 0 0.1305889 0 0.353 0 0.343 0 0.364 0 0.339 0
108 0 0.0120081 0 0.0306890 0 0.0620194 0 0.212 0 0.155 0 0.205 0 0.153 0
109 0 0.0023978 0 0.0143372 0 0.0851694 0 0.143 0 0.142 0 0.162 0 0.166 0
110 0 0.0087125 0 0.0472126 0 0.0974707 0 0.175 0 0.041 0 0.171 0 0.051 0
111 0 0.0002587 0 0.0016454 0 0.0096041 0 0.117 0 0.031 0 0.115 0 0.039 0
112 0 0.0374439 0 0.0881227 0 0.1046851 0 0.289 0 0.266 0 0.301 0 0.259 0
113 1 0.6602630 1 0.4250006 0 0.4205235 0 0.539 1 0.533 1 0.558 1 0.536 1
114 0 0.0008149 0 0.0041854 0 0.0291088 0 0.098 0 0.045 0 0.072 0 0.035 0
115 0 0.0010930 0 0.0040304 0 0.0241975 0 0.129 0 0.063 0 0.150 0 0.074 0
116 0 0.0295114 0 0.0338515 0 0.0955538 0 0.136 0 0.091 0 0.132 0 0.087 0
117 0 0.0001283 0 0.0004283 0 0.0080276 0 0.119 0 0.009 0 0.113 0 0.020 0
118 0 0.0035600 0 0.0021569 0 0.0186939 0 0.040 0 0.030 0 0.037 0 0.018 0
119 0 0.0006537 0 0.0047012 0 0.0276844 0 0.081 0 0.060 0 0.096 0 0.063 0
120 0 0.0031794 0 0.0042375 0 0.0596358 0 0.057 0 0.046 0 0.084 0 0.043 0
121 0 0.0010254 0 0.0008358 0 0.0031248 0 0.324 0 0.157 0 0.298 0 0.124 0
122 0 0.0012098 0 0.0016650 0 0.0211428 0 0.038 0 0.005 0 0.039 0 0.004 0
123 0 0.0074586 0 0.0204373 0 0.0403816 0 0.080 0 0.052 0 0.082 0 0.034 0
124 0 0.0031283 0 0.0053851 0 0.0063269 0 0.229 0 0.202 0 0.246 0 0.176 0
125 0 0.0000204 0 0.0000878 0 0.0043743 0 0.043 0 0.010 0 0.052 0 0.014 0
126 0 0.0166280 0 0.0087690 0 0.0479625 0 0.212 0 0.051 0 0.157 0 0.056 0
127 0 0.0011968 0 0.0022213 0 0.0144465 0 0.018 0 0.008 0 0.008 0 0.009 0
128 0 0.0552652 0 0.0235869 0 0.1225138 0 0.077 0 0.057 0 0.100 0 0.051 0
129 0 0.0000739 0 0.0001878 0 0.0090469 0 0.055 0 0.044 0 0.060 0 0.032 0
130 0 0.0458771 0 0.0251006 0 0.0763054 0 0.375 0 0.363 0 0.375 0 0.336 0
131 1 0.1582966 0 0.2847517 0 0.2313425 0 0.285 0 0.259 0 0.298 0 0.259 0
132 1 0.9867730 1 0.7702262 1 0.5484662 1 0.294 0 0.250 0 0.291 0 0.266 0
133 0 0.0040325 0 0.0181613 0 0.0283898 0 0.176 0 0.046 0 0.142 0 0.048 0
134 0 0.0010971 0 0.0006741 0 0.0054194 0 0.133 0 0.119 0 0.135 0 0.113 0
135 0 0.0001922 0 0.0010149 0 0.0119226 0 0.087 0 0.057 0 0.078 0 0.054 0
136 0 0.0047337 0 0.0206817 0 0.0684658 0 0.102 0 0.082 0 0.093 0 0.063 0
137 0 0.1861249 0 0.1726628 0 0.3146860 0 0.277 0 0.300 0 0.282 0 0.269 0
138 0 0.0165069 0 0.0676318 0 0.1193857 0 0.115 0 0.079 0 0.101 0 0.076 0
139 0 0.0000558 0 0.0011592 0 0.0133743 0 0.048 0 0.008 0 0.051 0 0.009 0
140 0 0.0013361 0 0.0025209 0 0.0160384 0 0.121 0 0.062 0 0.100 0 0.045 0
141 0 0.0198842 0 0.0084451 0 0.0488225 0 0.072 0 0.041 0 0.067 0 0.046 0
142 0 0.0010625 0 0.0060920 0 0.0200540 0 0.141 0 0.103 0 0.143 0 0.088 0
143 0 0.0001299 0 0.0005179 0 0.0133263 0 0.044 0 0.012 0 0.052 0 0.013 0
144 0 0.0000461 0 0.0000681 0 0.0072878 0 0.049 0 0.034 0 0.048 0 0.028 0
145 0 0.0283510 0 0.2112498 0 0.1671991 0 0.226 0 0.148 0 0.240 0 0.104 0
146 1 0.0133459 0 0.0344811 0 0.0709359 0 0.180 0 0.124 0 0.164 0 0.117 0
147 0 0.0003722 0 0.0008136 0 0.0143468 0 0.026 0 0.007 0 0.030 0 0.004 0
148 0 0.0316823 0 0.0770000 0 0.1231009 0 0.080 0 0.046 0 0.087 0 0.045 0
149 0 0.0006107 0 0.0020953 0 0.0186313 0 0.096 0 0.049 0 0.088 0 0.056 0
150 0 0.0791667 0 0.2314215 0 0.1601991 0 0.106 0 0.081 0 0.125 0 0.106 0
151 0 0.1108319 0 0.1664897 0 0.2618657 0 0.280 0 0.336 0 0.290 0 0.339 0
152 0 0.0004814 0 0.0078145 0 0.0217114 0 0.025 0 0.004 0 0.023 0 0.006 0
153 0 0.0033356 0 0.0134315 0 0.0575561 0 0.083 0 0.042 0 0.087 0 0.046 0
154 0 0.0046335 0 0.0092666 0 0.0165996 0 0.183 0 0.042 0 0.180 0 0.045 0
155 0 0.0002570 0 0.0007382 0 0.0165888 0 0.015 0 0.005 0 0.009 0 0.012 0
156 0 0.0016421 0 0.0012287 0 0.0137771 0 0.101 0 0.033 0 0.101 0 0.027 0
157 0 0.9411722 1 0.6982641 1 0.2205437 0 0.371 0 0.350 0 0.371 0 0.340 0
158 0 0.0893883 0 0.0705186 0 0.0699067 0 0.164 0 0.128 0 0.172 0 0.128 0
159 0 0.0017157 0 0.0099601 0 0.0426781 0 0.106 0 0.042 0 0.099 0 0.041 0
160 0 0.1246630 0 0.1432676 0 0.0472070 0 0.272 0 0.217 0 0.260 0 0.243 0
161 0 0.1430649 0 0.0456246 0 0.0391877 0 0.138 0 0.133 0 0.146 0 0.122 0
162 0 0.0000496 0 0.0002375 0 0.0037002 0 0.150 0 0.054 0 0.150 0 0.050 0
163 0 0.1261464 0 0.2363198 0 0.3048984 0 0.318 0 0.312 0 0.294 0 0.289 0
164 0 0.0833318 0 0.1556333 0 0.1718231 0 0.127 0 0.115 0 0.121 0 0.120 0
165 0 0.0000230 0 0.0001138 0 0.0046981 0 0.114 0 0.041 0 0.102 0 0.040 0
166 0 0.0672990 0 0.0736348 0 0.1030247 0 0.087 0 0.056 0 0.096 0 0.054 0
167 0 0.1336384 0 0.3358414 0 0.2935904 0 0.457 0 0.407 0 0.460 0 0.441 0
168 1 0.0106911 0 0.0667543 0 0.1105915 0 0.227 0 0.200 0 0.252 0 0.231 0
169 0 0.7705408 1 0.6477037 1 0.4028946 0 0.364 0 0.364 0 0.363 0 0.334 0
170 0 0.6982219 1 0.7134085 1 0.4480307 0 0.214 0 0.214 0 0.243 0 0.197 0
171 0 0.0011596 0 0.0077892 0 0.0329859 0 0.034 0 0.005 0 0.027 0 0.006 0
172 0 0.0031621 0 0.0032252 0 0.0078969 0 0.048 0 0.018 0 0.032 0 0.015 0
173 1 0.2938411 0 0.2874528 0 0.2882168 0 0.387 0 0.382 0 0.392 0 0.385 0
174 1 0.8631972 1 0.8935911 1 0.7075276 1 0.386 0 0.356 0 0.406 0 0.382 0
175 0 0.0001488 0 0.0015320 0 0.0151451 0 0.103 0 0.041 0 0.101 0 0.038 0
176 0 0.0000693 0 0.0016627 0 0.0215273 0 0.046 0 0.009 0 0.050 0 0.012 0
177 1 0.0039267 0 0.0024247 0 0.0066741 0 0.190 0 0.170 0 0.204 0 0.175 0
178 0 0.0030091 0 0.0057057 0 0.0293645 0 0.081 0 0.069 0 0.095 0 0.068 0
179 0 0.0342259 0 0.0171627 0 0.0593716 0 0.139 0 0.133 0 0.124 0 0.122 0
180 0 0.5133942 1 0.6122211 1 0.4549639 0 0.209 0 0.219 0 0.240 0 0.215 0
181 0 0.1869409 0 0.1948410 0 0.2555306 0 0.124 0 0.096 0 0.118 0 0.092 0
182 1 0.0087300 0 0.0139821 0 0.0231813 0 0.169 0 0.056 0 0.159 0 0.054 0
183 0 0.1390102 0 0.0560900 0 0.1507287 0 0.148 0 0.138 0 0.144 0 0.168 0
184 0 0.5888683 1 0.4489420 0 0.2886850 0 0.371 0 0.348 0 0.369 0 0.366 0
185 0 0.0059118 0 0.0087592 0 0.0409931 0 0.083 0 0.055 0 0.097 0 0.058 0
186 0 0.0318950 0 0.0093579 0 0.0559848 0 0.288 0 0.241 0 0.296 0 0.235 0
187 0 0.0009859 0 0.0052140 0 0.0266601 0 0.039 0 0.017 0 0.044 0 0.019 0
188 0 0.0001229 0 0.0005341 0 0.0105824 0 0.101 0 0.050 0 0.098 0 0.044 0
189 0 0.0501286 0 0.0625177 0 0.0781054 0 0.239 0 0.218 0 0.217 0 0.219 0
190 0 0.0719036 0 0.4355450 0 0.3978982 0 0.284 0 0.250 0 0.282 0 0.264 0
191 1 0.5787015 1 0.4825893 0 0.2627013 0 0.370 0 0.311 0 0.336 0 0.322 0
192 0 0.1324848 0 0.0780338 0 0.1611493 0 0.208 0 0.174 0 0.203 0 0.149 0
193 0 0.0008665 0 0.0020288 0 0.0309646 0 0.080 0 0.080 0 0.090 0 0.097 0
194 0 0.0092122 0 0.0096625 0 0.0320395 0 0.114 0 0.043 0 0.130 0 0.045 0
195 0 0.0023512 0 0.0080001 0 0.0329625 0 0.032 0 0.024 0 0.028 0 0.027 0
196 0 0.1173947 0 0.3540086 0 0.3397612 0 0.410 0 0.398 0 0.403 0 0.404 0
197 0 0.0001335 0 0.0002342 0 0.0016432 0 0.142 0 0.024 0 0.135 0 0.025 0
198 1 0.9534352 1 0.9087436 1 0.7479968 1 0.642 1 0.625 1 0.639 1 0.625 1
199 0 0.0564550 0 0.0927846 0 0.1625383 0 0.170 0 0.142 0 0.156 0 0.128 0
200 0 0.1128428 0 0.1328838 0 0.1383095 0 0.117 0 0.069 0 0.100 0 0.081 0
201 0 0.0019940 0 0.0040490 0 0.0424676 0 0.093 0 0.051 0 0.103 0 0.061 0
202 0 0.1363895 0 0.1336864 0 0.2929541 0 0.437 0 0.410 0 0.423 0 0.407 0
203 1 0.2063555 0 0.1655506 0 0.3039219 0 0.209 0 0.205 0 0.216 0 0.201 0
204 1 0.1259600 0 0.4018362 0 0.3130601 0 0.364 0 0.338 0 0.374 0 0.346 0
205 0 0.0444273 0 0.0358925 0 0.0859927 0 0.061 0 0.047 0 0.082 0 0.047 0
206 0 0.0000296 0 0.0004971 0 0.0077484 0 0.126 0 0.031 0 0.132 0 0.029 0
207 0 0.0017017 0 0.0124547 0 0.0733215 0 0.085 0 0.053 0 0.088 0 0.047 0
208 1 0.9538270 1 0.7509404 1 0.5645636 1 0.520 1 0.515 1 0.535 1 0.516 1
209 0 0.1367509 0 0.0539115 0 0.1027467 0 0.278 0 0.223 0 0.276 0 0.202 0
210 1 0.9104614 1 0.9643886 1 0.7332428 1 0.719 1 0.698 1 0.720 1 0.702 1
211 0 0.0194976 0 0.0429687 0 0.0679086 0 0.184 0 0.127 0 0.161 0 0.126 0
212 0 0.0000150 0 0.0002935 0 0.0096730 0 0.046 0 0.016 0 0.042 0 0.020 0
213 0 0.0000154 0 0.0000426 0 0.0014203 0 0.012 0 0.001 0 0.017 0 0.000 0
214 0 0.0005014 0 0.0004911 0 0.0096962 0 0.104 0 0.051 0 0.100 0 0.046 0
215 0 0.0001854 0 0.0004696 0 0.0092347 0 0.047 0 0.013 0 0.037 0 0.012 0
216 0 0.0018896 0 0.0121530 0 0.0724095 0 0.077 0 0.052 0 0.080 0 0.061 0
217 0 0.0166153 0 0.0640539 0 0.1212705 0 0.152 0 0.128 0 0.142 0 0.108 0
218 0 0.0001553 0 0.0007154 0 0.0096425 0 0.044 0 0.006 0 0.047 0 0.011 0
219 0 0.0078368 0 0.0112054 0 0.0354891 0 0.070 0 0.066 0 0.070 0 0.050 0
220 1 0.1107747 0 0.3949672 0 0.3689064 0 0.311 0 0.272 0 0.310 0 0.291 0
221 0 0.0216813 0 0.0256983 0 0.1130430 0 0.114 0 0.101 0 0.114 0 0.108 0
222 1 0.2583389 0 0.3855865 0 0.3062375 0 0.255 0 0.226 0 0.267 0 0.256 0
223 0 0.0003670 0 0.0007890 0 0.0063034 0 0.045 0 0.003 0 0.041 0 0.008 0
224 0 0.0006355 0 0.0008377 0 0.0125333 0 0.078 0 0.077 0 0.087 0 0.061 0
225 0 0.0165701 0 0.0095447 0 0.0354588 0 0.176 0 0.148 0 0.175 0 0.132 0
226 0 0.0000010 0 0.0000166 0 0.0015692 0 0.106 0 0.050 0 0.086 0 0.043 0
227 1 0.0074582 0 0.0083523 0 0.0120029 0 0.132 0 0.090 0 0.145 0 0.074 0
228 0 0.0515070 0 0.0978347 0 0.1826754 0 0.091 0 0.086 0 0.105 0 0.088 0
229 0 0.0000074 0 0.0000225 0 0.0024132 0 0.044 0 0.017 0 0.043 0 0.005 0
230 0 0.1506474 0 0.0664623 0 0.0887669 0 0.138 0 0.112 0 0.136 0 0.125 0
231 0 0.0043754 0 0.0041100 0 0.0114024 0 0.168 0 0.099 0 0.161 0 0.069 0
232 1 0.0184221 0 0.0856748 0 0.1087341 0 0.160 0 0.114 0 0.155 0 0.122 0
233 0 0.1329671 0 0.1732015 0 0.3284085 0 0.251 0 0.260 0 0.299 0 0.248 0
234 0 0.0038942 0 0.0101933 0 0.0267540 0 0.127 0 0.056 0 0.112 0 0.057 0
235 0 0.0015977 0 0.0031964 0 0.0237950 0 0.135 0 0.105 0 0.126 0 0.096 0
236 1 0.9375414 1 0.7716365 1 0.5573393 1 0.672 1 0.681 1 0.659 1 0.674 1
237 1 0.4045536 0 0.3748456 0 0.2866753 0 0.427 0 0.328 0 0.450 0 0.358 0
238 1 0.0651663 0 0.0585678 0 0.0951399 0 0.171 0 0.162 0 0.140 0 0.157 0
239 0 0.1702881 0 0.2540788 0 0.2155688 0 0.193 0 0.185 0 0.189 0 0.179 0
240 0 0.0027520 0 0.0299722 0 0.0722285 0 0.067 0 0.028 0 0.054 0 0.028 0
241 0 0.0256653 0 0.0123587 0 0.0246886 0 0.152 0 0.063 0 0.156 0 0.043 0
242 0 0.1411291 0 0.1538589 0 0.2239611 0 0.094 0 0.085 0 0.092 0 0.080 0
243 1 0.9935442 1 0.9912663 1 0.8697647 1 0.592 1 0.566 1 0.572 1 0.569 1
244 1 0.1295196 0 0.4720120 0 0.4108341 0 0.266 0 0.197 0 0.267 0 0.185 0
245 0 0.0000055 0 0.0000319 0 0.0022875 0 0.067 0 0.050 0 0.058 0 0.044 0
246 0 0.0781145 0 0.0397458 0 0.0857240 0 0.239 0 0.200 0 0.239 0 0.191 0
247 0 0.0004727 0 0.0016672 0 0.0207261 0 0.100 0 0.060 0 0.101 0 0.068 0
248 0 0.0084655 0 0.0233624 0 0.0629486 0 0.148 0 0.110 0 0.154 0 0.116 0
249 0 0.0360742 0 0.0829544 0 0.1734112 0 0.237 0 0.164 0 0.222 0 0.159 0
250 0 0.0019687 0 0.0014588 0 0.0128687 0 0.166 0 0.049 0 0.137 0 0.065 0
251 0 0.0053327 0 0.0056912 0 0.0340091 0 0.070 0 0.085 0 0.071 0 0.073 0
252 0 0.7617952 1 0.7347657 1 0.5660679 1 0.405 0 0.412 0 0.415 0 0.431 0
253 1 0.0898013 0 0.2047139 0 0.1254809 0 0.353 0 0.359 0 0.362 0 0.345 0
254 0 0.1616641 0 0.2563610 0 0.3238161 0 0.309 0 0.308 0 0.318 0 0.348 0
255 1 0.0268019 0 0.1062677 0 0.1389169 0 0.060 0 0.051 0 0.065 0 0.057 0
256 0 0.0439199 0 0.0606521 0 0.0753423 0 0.059 0 0.026 0 0.035 0 0.038 0
257 0 0.0384008 0 0.0734482 0 0.0941656 0 0.082 0 0.101 0 0.078 0 0.101 0
258 0 0.0561820 0 0.0240420 0 0.0754966 0 0.272 0 0.265 0 0.242 0 0.245 0
259 1 0.0066349 0 0.0192975 0 0.0499218 0 0.107 0 0.069 0 0.103 0 0.066 0
260 1 0.4405608 0 0.4056042 0 0.1675301 0 0.330 0 0.313 0 0.357 0 0.298 0
261 0 0.0006148 0 0.0028705 0 0.0256852 0 0.112 0 0.100 0 0.121 0 0.099 0
262 0 0.1642043 0 0.2982706 0 0.3518394 0 0.327 0 0.198 0 0.279 0 0.205 0
263 0 0.0086549 0 0.1306591 0 0.1391573 0 0.103 0 0.091 0 0.091 0 0.081 0
264 0 0.7753201 1 0.6576590 1 0.3917544 0 0.368 0 0.322 0 0.332 0 0.353 0
265 0 0.0578950 0 0.0265818 0 0.0651920 0 0.154 0 0.081 0 0.161 0 0.074 0
266 0 0.0779777 0 0.1534118 0 0.1638349 0 0.168 0 0.100 0 0.141 0 0.100 0
267 0 0.0007552 0 0.0006586 0 0.0075086 0 0.102 0 0.016 0 0.093 0 0.016 0
268 0 0.0000527 0 0.0007938 0 0.0107259 0 0.143 0 0.032 0 0.135 0 0.032 0
269 0 0.0170257 0 0.0053237 0 0.0254372 0 0.166 0 0.124 0 0.155 0 0.135 0
270 0 0.0003457 0 0.0017817 0 0.0182937 0 0.066 0 0.047 0 0.080 0 0.067 0
271 0 0.0090884 0 0.0109081 0 0.0579455 0 0.112 0 0.083 0 0.129 0 0.090 0
272 0 0.0030082 0 0.0125604 0 0.0590461 0 0.172 0 0.180 0 0.202 0 0.195 0
273 0 0.0001095 0 0.0003213 0 0.0118005 0 0.030 0 0.022 0 0.037 0 0.019 0
274 0 0.0012203 0 0.0034233 0 0.0264937 0 0.121 0 0.092 0 0.126 0 0.091 0
275 0 0.0663109 0 0.0427113 0 0.1271579 0 0.171 0 0.125 0 0.140 0 0.138 0
276 0 0.0058593 0 0.0072033 0 0.0427159 0 0.261 0 0.218 0 0.281 0 0.205 0
277 0 0.1905733 0 0.1634716 0 0.2493260 0 0.401 0 0.395 0 0.405 0 0.406 0
278 0 0.0004218 0 0.0007031 0 0.0034163 0 0.182 0 0.099 0 0.189 0 0.091 0
279 1 0.9763644 1 0.9585561 1 0.6499532 1 0.405 0 0.393 0 0.400 0 0.423 0
280 0 0.0002421 0 0.0004622 0 0.0039789 0 0.148 0 0.078 0 0.160 0 0.077 0
281 0 0.0048949 0 0.0128794 0 0.0639464 0 0.140 0 0.056 0 0.133 0 0.062 0
282 0 0.0190399 0 0.0092756 0 0.0106796 0 0.178 0 0.059 0 0.175 0 0.066 0
283 0 0.6394043 1 0.7421519 1 0.5425470 1 0.448 0 0.435 0 0.422 0 0.457 0
284 0 0.0413549 0 0.1251049 0 0.2116007 0 0.156 0 0.149 0 0.164 0 0.156 0
285 0 0.2725283 0 0.1550718 0 0.0809102 0 0.160 0 0.061 0 0.171 0 0.046 0
286 0 0.1620246 0 0.0937113 0 0.0655731 0 0.172 0 0.120 0 0.129 0 0.096 0
287 1 0.0002529 0 0.0006681 0 0.0157892 0 0.014 0 0.005 0 0.011 0 0.006 0
288 0 0.0120464 0 0.0241450 0 0.0445099 0 0.182 0 0.102 0 0.176 0 0.084 0
289 1 0.7787311 1 0.4977424 0 0.4436588 0 0.476 0 0.464 0 0.438 0 0.434 0
290 0 0.0072036 0 0.0138191 0 0.0746379 0 0.092 0 0.051 0 0.099 0 0.060 0
291 0 0.0010917 0 0.0023725 0 0.0212708 0 0.184 0 0.124 0 0.179 0 0.139 0
292 0 0.0001563 0 0.0003600 0 0.0082463 0 0.135 0 0.032 0 0.134 0 0.034 0
293 0 0.0036373 0 0.0058607 0 0.0257565 0 0.059 0 0.028 0 0.064 0 0.036 0
294 0 0.0129333 0 0.0082475 0 0.0411094 0 0.052 0 0.018 0 0.057 0 0.016 0
295 0 0.0014167 0 0.0030087 0 0.0268358 0 0.031 0 0.009 0 0.032 0 0.008 0
296 0 0.0030184 0 0.0065308 0 0.0372915 0 0.332 0 0.285 0 0.326 0 0.267 0
297 1 0.2879939 0 0.0656793 0 0.0927326 0 0.469 0 0.503 1 0.498 0 0.504 1
298 0 0.0412993 0 0.0143946 0 0.0528896 0 0.205 0 0.173 0 0.209 0 0.173 0
299 1 0.6097509 1 0.5898293 1 0.4375606 0 0.403 0 0.315 0 0.392 0 0.297 0
300 0 0.0031829 0 0.0021985 0 0.0173159 0 0.103 0 0.044 0 0.078 0 0.034 0
301 0 0.0295631 0 0.0346101 0 0.0593214 0 0.122 0 0.109 0 0.152 0 0.093 0
302 1 0.7236389 1 0.6634866 1 0.3159985 0 0.435 0 0.467 0 0.439 0 0.454 0
303 0 0.0045371 0 0.0836413 0 0.0714527 0 0.252 0 0.195 0 0.249 0 0.186 0
304 0 0.0009521 0 0.0098941 0 0.0420243 0 0.255 0 0.167 0 0.243 0 0.174 0
305 1 0.3687107 0 0.2923503 0 0.2304483 0 0.268 0 0.217 0 0.258 0 0.214 0
306 0 0.0001028 0 0.0003747 0 0.0078390 0 0.132 0 0.041 0 0.134 0 0.032 0
307 0 0.3873544 0 0.1253572 0 0.3187728 0 0.439 0 0.413 0 0.414 0 0.412 0
308 1 0.8273117 1 0.8263663 1 0.6284362 1 0.454 0 0.427 0 0.472 0 0.438 0
309 1 0.6916826 1 0.6861170 1 0.4827902 0 0.316 0 0.259 0 0.275 0 0.265 0
310 0 0.0253008 0 0.0647111 0 0.1127145 0 0.229 0 0.152 0 0.232 0 0.147 0
311 0 0.0136116 0 0.0089816 0 0.0322462 0 0.107 0 0.097 0 0.129 0 0.092 0
312 0 0.0160134 0 0.0415754 0 0.1283516 0 0.136 0 0.135 0 0.153 0 0.141 0
313 1 0.3342967 0 0.4326685 0 0.4787350 0 0.599 1 0.566 1 0.586 1 0.614 1
314 0 0.0359444 0 0.0430882 0 0.1533923 0 0.073 0 0.058 0 0.064 0 0.058 0
315 0 0.0043276 0 0.0107379 0 0.0579866 0 0.089 0 0.056 0 0.102 0 0.058 0
316 0 0.0008598 0 0.0026969 0 0.0218097 0 0.038 0 0.009 0 0.040 0 0.012 0
317 0 0.0071281 0 0.0133885 0 0.0540520 0 0.136 0 0.056 0 0.146 0 0.064 0
318 0 0.0385903 0 0.0461769 0 0.0754486 0 0.144 0 0.071 0 0.141 0 0.068 0
319 0 0.0276305 0 0.0584266 0 0.0950141 0 0.113 0 0.087 0 0.120 0 0.084 0
320 0 0.0000075 0 0.0000128 0 0.0014759 0 0.036 0 0.001 0 0.034 0 0.005 0
321 1 0.3870047 0 0.2451435 0 0.2151350 0 0.275 0 0.207 0 0.272 0 0.183 0
322 0 0.0029417 0 0.0095101 0 0.0522592 0 0.156 0 0.060 0 0.137 0 0.067 0
323 1 0.7822285 1 0.8555540 1 0.5147541 1 0.435 0 0.464 0 0.460 0 0.436 0
324 0 0.0089589 0 0.0124964 0 0.0616423 0 0.170 0 0.123 0 0.166 0 0.123 0
325 0 0.0170849 0 0.0179949 0 0.1342629 0 0.243 0 0.183 0 0.210 0 0.211 0
326 0 0.0055259 0 0.0060352 0 0.0501381 0 0.070 0 0.032 0 0.046 0 0.049 0
327 1 0.0017862 0 0.0027621 0 0.0297746 0 0.231 0 0.163 0 0.226 0 0.133 0
328 0 0.0143002 0 0.0621361 0 0.1205186 0 0.038 0 0.025 0 0.051 0 0.034 0
329 0 0.0024926 0 0.0127504 0 0.0494858 0 0.132 0 0.073 0 0.122 0 0.077 0
330 0 0.0083335 0 0.1106269 0 0.1549034 0 0.087 0 0.083 0 0.092 0 0.090 0
331 0 0.0059159 0 0.0056931 0 0.0275003 0 0.098 0 0.068 0 0.092 0 0.062 0
332 0 0.0102076 0 0.0477751 0 0.0702498 0 0.192 0 0.156 0 0.200 0 0.128 0
333 0 0.1119398 0 0.1250779 0 0.2642530 0 0.430 0 0.414 0 0.450 0 0.440 0
334 0 0.0114611 0 0.0143167 0 0.0503824 0 0.220 0 0.120 0 0.215 0 0.108 0
335 0 0.0487797 0 0.0303058 0 0.0830845 0 0.146 0 0.096 0 0.163 0 0.110 0
336 0 0.0630771 0 0.0497466 0 0.0682846 0 0.108 0 0.104 0 0.132 0 0.127 0
337 0 0.0000992 0 0.0007246 0 0.0063580 0 0.144 0 0.012 0 0.161 0 0.022 0
338 0 0.0650918 0 0.0533453 0 0.1113534 0 0.076 0 0.064 0 0.105 0 0.049 0
339 0 0.0040542 0 0.0112457 0 0.0502029 0 0.047 0 0.040 0 0.061 0 0.035 0
340 0 0.0056036 0 0.0131826 0 0.0862792 0 0.056 0 0.016 0 0.046 0 0.022 0
341 0 0.3024282 0 0.4394254 0 0.4772447 0 0.253 0 0.237 0 0.286 0 0.225 0
342 0 0.4537504 0 0.4552905 0 0.3654867 0 0.112 0 0.097 0 0.097 0 0.104 0
343 0 0.0062782 0 0.0121275 0 0.0309235 0 0.106 0 0.006 0 0.107 0 0.013 0
344 0 0.0000827 0 0.0005784 0 0.0122198 0 0.115 0 0.059 0 0.137 0 0.073 0
345 0 0.1474822 0 0.2198360 0 0.2216298 0 0.170 0 0.147 0 0.163 0 0.161 0
346 1 0.9865033 1 0.9814769 1 0.8084308 1 0.658 1 0.653 1 0.676 1 0.657 1
347 0 0.0002680 0 0.0010143 0 0.0187046 0 0.044 0 0.009 0 0.043 0 0.005 0
348 0 0.0129712 0 0.0135429 0 0.0215331 0 0.132 0 0.035 0 0.122 0 0.026 0
349 1 0.0048632 0 0.0193661 0 0.0492586 0 0.221 0 0.100 0 0.219 0 0.115 0
350 0 0.0116420 0 0.0463087 0 0.0521836 0 0.172 0 0.116 0 0.156 0 0.093 0
351 0 0.9875202 1 0.9379875 1 0.7145609 1 0.553 1 0.591 1 0.539 1 0.562 1
352 0 0.3794611 0 0.5469061 1 0.2761320 0 0.243 0 0.186 0 0.239 0 0.193 0
353 0 0.0002416 0 0.0003731 0 0.0110078 0 0.070 0 0.052 0 0.074 0 0.040 0
354 0 0.0016854 0 0.0124215 0 0.0479366 0 0.051 0 0.015 0 0.038 0 0.019 0
355 0 0.0003788 0 0.0004168 0 0.0071553 0 0.058 0 0.017 0 0.055 0 0.019 0
356 0 0.0005569 0 0.0012339 0 0.0088800 0 0.013 0 0.005 0 0.007 0 0.003 0
357 0 0.0003318 0 0.0007113 0 0.0195470 0 0.055 0 0.040 0 0.045 0 0.033 0
358 0 0.0000796 0 0.0002305 0 0.0085847 0 0.042 0 0.010 0 0.042 0 0.017 0
359 0 0.4062016 0 0.5313807 1 0.4140644 0 0.212 0 0.192 0 0.211 0 0.177 0
360 0 0.1368100 0 0.3148454 0 0.2357519 0 0.164 0 0.176 0 0.163 0 0.158 0
361 0 0.0001769 0 0.0002767 0 0.0091177 0 0.035 0 0.020 0 0.030 0 0.014 0
362 0 0.0189582 0 0.0160666 0 0.0964778 0 0.420 0 0.350 0 0.414 0 0.340 0
363 0 0.1218905 0 0.1411406 0 0.1380381 0 0.391 0 0.345 0 0.403 0 0.358 0
364 0 0.0034809 0 0.0080162 0 0.0625897 0 0.135 0 0.101 0 0.105 0 0.102 0
365 0 0.0503732 0 0.0231125 0 0.0879836 0 0.335 0 0.291 0 0.325 0 0.326 0
366 1 0.4379206 0 0.4971121 0 0.4206947 0 0.442 0 0.463 0 0.467 0 0.485 0
367 0 0.0008674 0 0.0040601 0 0.0357247 0 0.185 0 0.120 0 0.180 0 0.095 0
368 1 0.0187812 0 0.0317394 0 0.0988180 0 0.237 0 0.162 0 0.224 0 0.148 0
369 0 0.0049546 0 0.0070580 0 0.0183464 0 0.127 0 0.092 0 0.108 0 0.080 0
370 0 0.0049716 0 0.0087068 0 0.0485143 0 0.046 0 0.022 0 0.053 0 0.034 0
371 0 0.8972721 1 0.9127597 1 0.7727327 1 0.386 0 0.364 0 0.380 0 0.354 0
372 0 0.0543964 0 0.0795477 0 0.1288612 0 0.143 0 0.159 0 0.166 0 0.150 0
373 0 0.0387480 0 0.0480707 0 0.1746278 0 0.109 0 0.089 0 0.110 0 0.093 0
374 0 0.0016205 0 0.0032786 0 0.0262977 0 0.040 0 0.008 0 0.032 0 0.018 0
375 0 0.0000738 0 0.0002568 0 0.0090467 0 0.023 0 0.013 0 0.020 0 0.006 0
376 0 0.0013806 0 0.0021297 0 0.0408112 0 0.103 0 0.039 0 0.107 0 0.044 0
377 1 0.5809632 1 0.1298864 0 0.2085647 0 0.160 0 0.175 0 0.209 0 0.176 0
378 0 0.0985561 0 0.1689119 0 0.1512088 0 0.276 0 0.218 0 0.261 0 0.239 0
379 0 0.1657761 0 0.0919988 0 0.1078083 0 0.181 0 0.177 0 0.193 0 0.179 0
380 0 0.0064923 0 0.0099261 0 0.0196963 0 0.058 0 0.029 0 0.063 0 0.041 0
381 0 0.0434580 0 0.0696445 0 0.1164609 0 0.156 0 0.076 0 0.162 0 0.091 0
382 0 0.0753815 0 0.0455306 0 0.1198244 0 0.320 0 0.311 0 0.300 0 0.309 0
383 0 0.0054673 0 0.0150402 0 0.0584995 0 0.108 0 0.077 0 0.114 0 0.069 0
384 0 0.0036745 0 0.0075175 0 0.0561052 0 0.314 0 0.282 0 0.295 0 0.309 0
385 1 0.8804426 1 0.8095081 1 0.6404735 1 0.550 1 0.525 1 0.533 1 0.551 1
386 0 0.0251341 0 0.1173158 0 0.1651129 0 0.113 0 0.074 0 0.142 0 0.067 0
387 0 0.0027132 0 0.0073188 0 0.0294084 0 0.067 0 0.043 0 0.046 0 0.032 0
388 1 0.9817922 1 0.9546164 1 0.7522236 1 0.434 0 0.423 0 0.419 0 0.415 0
389 0 0.0048077 0 0.0069992 0 0.0365350 0 0.056 0 0.024 0 0.049 0 0.022 0
390 1 0.2893101 0 0.5313151 1 0.4680589 0 0.492 0 0.499 0 0.514 1 0.537 1
391 0 0.0470439 0 0.0843173 0 0.1249616 0 0.147 0 0.102 0 0.149 0 0.107 0
392 1 0.0020593 0 0.0084588 0 0.0467655 0 0.068 0 0.041 0 0.053 0 0.050 0
393 1 0.0922388 0 0.3096804 0 0.2715704 0 0.391 0 0.377 0 0.396 0 0.365 0
394 1 0.0012029 0 0.0062440 0 0.0213585 0 0.071 0 0.010 0 0.049 0 0.016 0
395 0 0.0014272 0 0.0011913 0 0.0121661 0 0.062 0 0.027 0 0.071 0 0.032 0
396 0 0.0128770 0 0.0280927 0 0.0757918 0 0.089 0 0.059 0 0.096 0 0.070 0
397 1 0.0177932 0 0.0102105 0 0.0529933 0 0.414 0 0.378 0 0.386 0 0.365 0
398 1 0.9948978 1 0.8587088 1 0.6437008 1 0.473 0 0.518 1 0.489 0 0.511 1
399 0 0.1107487 0 0.1112286 0 0.1258211 0 0.110 0 0.097 0 0.131 0 0.089 0
400 0 0.0006625 0 0.0021320 0 0.0168967 0 0.089 0 0.094 0 0.091 0 0.093 0
401 0 0.1979463 0 0.1231776 0 0.1169248 0 0.192 0 0.212 0 0.197 0 0.184 0
402 0 0.0000176 0 0.0000155 0 0.0018284 0 0.130 0 0.055 0 0.120 0 0.043 0
403 0 0.0190374 0 0.0608910 0 0.0793835 0 0.120 0 0.081 0 0.125 0 0.069 0
404 0 0.0331881 0 0.0190314 0 0.0346026 0 0.208 0 0.166 0 0.200 0 0.127 0
405 0 0.1149406 0 0.0826578 0 0.2235723 0 0.158 0 0.105 0 0.166 0 0.116 0
406 0 0.0025976 0 0.0034800 0 0.0302775 0 0.087 0 0.047 0 0.085 0 0.047 0
407 0 0.0675733 0 0.5676680 1 0.4420326 0 0.167 0 0.176 0 0.162 0 0.189 0
408 0 0.0066697 0 0.0104739 0 0.0414407 0 0.060 0 0.019 0 0.069 0 0.017 0
409 0 0.3234854 0 0.0949732 0 0.1900134 0 0.202 0 0.154 0 0.201 0 0.161 0
410 0 0.0016931 0 0.0048759 0 0.0423037 0 0.233 0 0.235 0 0.247 0 0.232 0
411 1 0.5035777 1 0.4717216 0 0.3709962 0 0.264 0 0.244 0 0.247 0 0.256 0
412 0 0.1251573 0 0.2013830 0 0.1117740 0 0.200 0 0.190 0 0.226 0 0.181 0
413 0 0.0337413 0 0.3038647 0 0.2528104 0 0.248 0 0.217 0 0.266 0 0.217 0
414 0 0.0003892 0 0.0007819 0 0.0100029 0 0.158 0 0.046 0 0.154 0 0.056 0
415 0 0.0804113 0 0.0486219 0 0.0711035 0 0.127 0 0.063 0 0.138 0 0.066 0
416 0 0.0602379 0 0.1583113 0 0.0952658 0 0.093 0 0.065 0 0.091 0 0.062 0
417 0 0.0002368 0 0.0022071 0 0.0283572 0 0.132 0 0.120 0 0.125 0 0.112 0
418 0 0.0169605 0 0.0355325 0 0.0591055 0 0.244 0 0.211 0 0.266 0 0.209 0
419 0 0.0125479 0 0.0185922 0 0.0436857 0 0.096 0 0.052 0 0.105 0 0.046 0
420 0 0.0120720 0 0.1473026 0 0.2416239 0 0.295 0 0.250 0 0.287 0 0.244 0
421 0 0.0001443 0 0.0019544 0 0.0115942 0 0.039 0 0.029 0 0.041 0 0.028 0
422 0 0.1673270 0 0.0757798 0 0.1046299 0 0.346 0 0.381 0 0.372 0 0.375 0
423 0 0.0020409 0 0.0059440 0 0.0378728 0 0.119 0 0.092 0 0.116 0 0.079 0
424 0 0.0000156 0 0.0000450 0 0.0034086 0 0.046 0 0.004 0 0.042 0 0.003 0
425 0 0.4688507 0 0.5841083 1 0.2436172 0 0.135 0 0.030 0 0.146 0 0.028 0
426 0 0.0994804 0 0.2985674 0 0.2636824 0 0.208 0 0.181 0 0.205 0 0.193 0
427 0 0.4394713 0 0.3902367 0 0.2400195 0 0.257 0 0.234 0 0.282 0 0.222 0
428 0 0.0015534 0 0.0032903 0 0.0289365 0 0.071 0 0.032 0 0.062 0 0.034 0
429 0 0.7044950 1 0.7721093 1 0.5472882 1 0.625 1 0.615 1 0.608 1 0.604 1
430 0 0.0000312 0 0.0001290 0 0.0052183 0 0.081 0 0.045 0 0.083 0 0.049 0
431 0 0.0019083 0 0.0045630 0 0.0385711 0 0.086 0 0.034 0 0.077 0 0.046 0
432 0 0.0120290 0 0.0079796 0 0.0415151 0 0.066 0 0.026 0 0.074 0 0.024 0
433 0 0.2988495 0 0.2925479 0 0.3141300 0 0.274 0 0.223 0 0.257 0 0.251 0
434 1 0.0165202 0 0.0146160 0 0.0473238 0 0.096 0 0.043 0 0.076 0 0.034 0
435 0 0.0308433 0 0.0416157 0 0.0863887 0 0.118 0 0.073 0 0.108 0 0.081 0
436 0 0.0219957 0 0.0287299 0 0.0915299 0 0.081 0 0.062 0 0.080 0 0.069 0
437 0 0.1398414 0 0.1598856 0 0.2819705 0 0.109 0 0.102 0 0.090 0 0.108 0
438 0 0.0000436 0 0.0001129 0 0.0057065 0 0.058 0 0.029 0 0.047 0 0.031 0
439 0 0.0018365 0 0.0086006 0 0.0469397 0 0.079 0 0.062 0 0.081 0 0.071 0
440 0 0.0006884 0 0.0030133 0 0.0235452 0 0.038 0 0.012 0 0.034 0 0.014 0
441 0 0.0004364 0 0.0044121 0 0.0305393 0 0.104 0 0.032 0 0.109 0 0.046 0

Differences

Column

Number misclassified

# of misclassified instances - Logistic Regr:  57
# of misclassified instances - SLR(lambda_min):  64
# of misclassified instances - SLR(lambda_1se):  61
# of misclassified instances - RF_full_w/_outliers:  61
# of misclassified instances - RF_full_w/o_outliers:  59
# of misclassified instances - RF_reduced_w/_outliers:  60
# of misclassified instances - RF_reduced_w/o_outliers:  58

Misclass Agreements (continuous)

Misclass Agreements (discrete)

Misclassified (All models agree)

  • This is a table of the test data instances that were misclassified. These specific instances were misclassified across all models applied.
# of the SAME missclassified instances that occur across ALL models:  43


index age attrition businesstravel dailyrate department distancefromhome education educationfield environmentsatisfaction gender hourlyrate jobinvolvement joblevel jobrole jobsatisfaction maritalstatus monthlyincome monthlyrate numcompaniesworked overtime percentsalaryhike performancerating relationshipsatisfaction stockoptionlevel totalworkingyears trainingtimeslastyear worklifebalance yearsatcompany yearsincurrentrole yearssincelastpromotion yearswithcurrmanager
1 37 1 travel_rarely 1373 research & development 2 2 other 4 male 92 2 1 laboratory technician 3 single 2090 2396 6 yes 15 3 2 0 7 3 3 0 0 0 0
10 24 1 travel_rarely 813 research & development 1 3 medical 2 male 61 3 1 research scientist 4 married 2293 3020 2 yes 16 3 1 1 6 2 2 2 0 2 0
33 51 1 travel_frequently 1150 research & development 8 4 life sciences 1 male 53 1 3 manufacturing director 4 single 10650 25150 2 no 15 3 4 0 18 2 3 4 2 0 3
36 32 1 travel_rarely 1033 research & development 9 3 medical 1 female 41 3 1 laboratory technician 1 single 4200 10224 7 no 22 4 1 0 10 2 4 5 4 0 4
55 38 1 travel_rarely 1180 research & development 29 1 medical 2 male 70 3 2 healthcare representative 1 married 6673 11354 7 yes 19 3 2 0 17 2 3 1 0 0 0
56 29 1 travel_rarely 121 sales 27 3 marketing 2 female 35 3 3 sales executive 4 married 7639 24525 1 no 22 4 4 3 10 3 2 10 4 1 9
57 32 1 travel_rarely 1045 sales 4 4 medical 4 male 32 1 3 sales executive 4 married 10400 25812 1 no 11 3 3 0 14 2 2 14 8 9 8
68 32 1 travel_rarely 515 research & development 1 3 life sciences 4 male 62 2 1 laboratory technician 3 single 3730 9571 0 yes 14 3 4 0 4 2 1 3 2 1 2
72 37 1 travel_frequently 504 research & development 10 3 medical 1 male 61 3 3 manufacturing director 3 divorced 10048 22573 6 no 11 3 2 2 17 5 3 1 0 0 0
79 47 1 non-travel 666 research & development 29 4 life sciences 1 male 88 3 3 manager 2 married 11849 10268 1 yes 12 3 4 1 10 2 2 10 7 9 9
131 31 1 travel_frequently 534 research & development 20 3 life sciences 1 male 66 3 3 healthcare representative 3 married 9824 22908 3 no 12 3 1 0 12 2 3 1 0 0 0
146 31 1 travel_rarely 1365 sales 13 4 medical 2 male 46 3 2 sales executive 1 divorced 4233 11512 2 no 17 3 3 0 9 2 1 3 1 1 2
168 33 1 travel_rarely 527 research & development 1 4 other 4 male 63 3 1 research scientist 4 single 2686 5207 1 yes 13 3 3 0 10 2 2 10 9 7 8
173 23 1 travel_rarely 1243 research & development 6 3 life sciences 3 male 63 4 1 laboratory technician 1 married 1601 3445 1 yes 21 4 3 2 1 2 3 0 0 0 0
177 58 1 travel_rarely 286 research & development 2 4 life sciences 4 male 31 3 5 research director 2 single 19246 25761 7 yes 12 3 4 0 40 2 3 31 15 13 8
182 55 1 travel_rarely 436 sales 2 1 medical 3 male 37 3 2 sales executive 4 single 5160 21519 4 no 16 3 3 0 12 3 2 9 7 7 3
203 41 1 travel_rarely 1085 research & development 2 4 life sciences 2 female 57 1 1 laboratory technician 4 divorced 2778 17725 4 yes 13 3 3 1 10 1 2 7 7 1 0
204 39 1 travel_rarely 1122 research & development 6 3 medical 4 male 70 3 1 laboratory technician 1 married 2404 4303 7 yes 21 4 4 0 8 2 1 2 2 2 2
220 35 1 travel_rarely 622 research & development 14 4 other 3 male 39 2 1 laboratory technician 2 divorced 3743 10074 1 yes 24 4 4 1 5 2 1 4 2 0 2
222 30 1 travel_frequently 109 research & development 5 3 medical 2 female 60 3 1 laboratory technician 2 single 2422 25725 0 no 17 3 1 0 4 3 3 3 2 1 2
227 36 1 travel_rarely 885 research & development 16 4 life sciences 3 female 43 4 1 laboratory technician 1 single 2743 8269 1 no 16 3 3 0 18 1 3 17 13 15 14
232 36 1 travel_rarely 660 research & development 15 3 other 1 male 81 3 2 laboratory technician 3 divorced 4834 7858 7 no 14 3 2 1 9 3 2 1 0 0 0
237 21 1 travel_rarely 1334 research & development 10 3 life sciences 3 female 36 2 1 laboratory technician 1 single 1416 17258 1 no 13 3 1 0 1 6 2 1 0 1 0
238 28 1 non-travel 1366 research & development 24 2 technical degree 2 male 72 2 3 healthcare representative 1 single 8722 12355 1 no 12 3 1 0 10 2 2 10 7 1 9
244 50 1 travel_frequently 959 sales 1 4 other 4 male 81 3 2 sales executive 3 single 4728 17251 3 yes 14 3 4 0 5 4 3 0 0 0 0
253 18 1 non-travel 247 research & development 8 1 medical 3 male 80 3 1 laboratory technician 3 single 1904 13556 1 no 12 3 4 0 0 0 3 0 0 0 0
255 31 1 travel_frequently 874 research & development 15 3 medical 3 male 72 3 1 laboratory technician 3 married 2610 6233 1 no 12 3 3 1 2 5 2 2 2 2 2
259 29 1 travel_rarely 408 sales 23 1 life sciences 4 female 45 2 3 sales executive 1 married 7336 11162 1 no 13 3 1 1 11 3 1 11 8 3 10
260 42 1 travel_frequently 481 sales 12 3 life sciences 3 male 44 3 4 sales executive 1 single 13758 2447 0 yes 12 3 2 0 22 2 2 21 9 13 14
287 32 1 travel_rarely 1089 research & development 7 2 life sciences 4 male 79 3 2 laboratory technician 3 married 4883 22845 1 no 18 3 1 1 10 3 3 10 4 1 1
305 49 1 travel_frequently 1475 research & development 28 2 life sciences 1 male 97 2 2 laboratory technician 1 single 4284 22710 3 no 20 4 1 0 20 2 3 4 3 1 3
321 28 1 travel_frequently 1496 sales 1 3 technical degree 1 male 92 3 1 sales representative 3 married 2909 15747 3 no 15 3 4 1 5 3 4 3 2 1 2
327 40 1 travel_rarely 676 research & development 9 4 life sciences 4 male 86 3 1 laboratory technician 1 single 2018 21831 3 no 14 3 2 0 15 3 1 5 4 1 0
349 35 1 travel_rarely 737 sales 10 3 medical 4 male 55 2 3 sales executive 1 married 10306 21530 9 no 17 3 3 0 15 3 3 13 12 6 0
351 24 0 travel_frequently 567 research & development 2 1 technical degree 1 female 32 3 1 research scientist 4 single 3760 17218 1 yes 13 3 3 0 6 2 3 6 3 1 3
366 23 1 travel_rarely 1320 research & development 8 1 medical 4 male 93 2 1 laboratory technician 3 single 3989 20586 1 yes 11 3 1 0 5 2 3 5 4 1 2
368 32 1 travel_rarely 1259 research & development 2 4 life sciences 4 male 95 3 1 laboratory technician 2 single 1393 24852 1 no 12 3 1 0 1 2 3 1 0 0 0
392 37 1 travel_rarely 370 research & development 10 4 medical 4 male 58 3 2 manufacturing director 1 single 4213 4992 1 no 15 3 2 0 10 4 1 10 3 0 8
393 26 1 travel_rarely 920 human resources 20 2 medical 4 female 69 3 1 human resources 2 married 2148 6889 0 yes 11 3 3 0 6 3 3 5 1 1 4
394 46 1 travel_rarely 261 research & development 21 2 medical 4 female 66 3 2 healthcare representative 2 married 8926 10842 4 no 22 4 4 1 13 2 4 9 7 3 7
397 31 1 travel_rarely 359 human resources 18 5 human resources 4 male 89 4 1 human resources 1 married 2956 21495 0 no 17 3 3 0 2 4 3 1 0 0 0
429 21 0 travel_rarely 501 sales 5 1 medical 3 male 58 3 1 sales representative 1 single 2380 25479 1 yes 11 3 4 0 2 6 3 2 2 1 2
434 50 1 travel_frequently 878 sales 1 4 life sciences 2 male 94 3 2 sales executive 3 divorced 6728 14255 7 no 12 3 4 2 12 3 3 6 3 0 1

Observations & Notes

Observations & Notes

  • The reduced Random Forest model on data without outliers has the best AUC and classification rate on the test set.

  • The logistic regression model built on data without outliers has the best correct classfication rate on the test set.

  • All random forest models had better AUC values than the logistic regression model, but each also had a lower correct classification rate than the logistic regression model. However, the differences were small (classification rate difference between approximately (0.009, 0.011) and AUC differences between approximately (0.0023, 0.0091)).

  • After doing some reading on-line, disagreements between AUC performance and Correct Classification Rate may occur because of having unbalanced data sets and/or having an accuracy threshold value of 0.5 (which was used in this project). To troubleshoot any of these issues, futher examination is needed of the ROC curves, threshold values used, predicted probabilities (possibly), and/or other performance measures (i.e., sensitivity, specificity, etc.). Most likely, I suspect that the issue of not having both the best AUC with the highest classification rate is related to the data set being unbalanced.

  • Upon inspecting the ROC curves, we note that the area involving the greatest disagreement between the logistic regression model and the random forest models exists where \(1-specificity\) is between \((0.25, 0.5)\).

  • Given the previous work, if the objective is predicting attrition then applying the reduced/sparse random forest model on data that does not contain outliers and/or correlated predictor variables appears to be the best method to use (of those explored here).

  • For the purpose of this project/exercise, I want to explore how various predictor variables affect the odds or probability of \(attrition = 1 (yes)\). To do so, I am choosing the logistic regression model to gain further insights from the data. This model is chosen because:
    • it’s more easly interpreted for inference purposes
    • it has the best correct classification rate
    • it’s AUC difference is is less that 0.01 compared to other models

Chosen Model

Chosen model

\[\begin{align} logit[P(attrition = 1 (Yes))] = &11.86865 - 0.36314age + 0.00426age^2 + \beta_2businesstravel \\ &- 0.00064dailyrate + 0.06069distancefromhome + \\ &\beta_7educationfield - 0.90788environmentsatisfaction \\ &- 1.19419jobinvolvement - 0.42537jobsatisfaction + \\ &\beta_{15}maritalstatus + 0.29720numcompaniesworked + \\ &\beta_{19}overtime - 0.20686totalworkingyears \\ &- 0.29303trainingtimeslastyear \\ &- 0.20462yearsincurrentrole + \\ &0.26748yearssincelastpromotion \end{align}\]

where \[\beta_2businesstravel = \begin{cases} 0, \quad for \enspace businesstravel = non-travel(1)\\ & \\ 2.44736, \quad for \enspace businesstravel = travel \enspace rarely(2)\\ & \\ 4.06929, \quad for \enspace businesstravel = travel \enspace frequently(3) \end{cases} \]


\[\beta_7educationfield = \begin{cases} 0, \quad for \enspace educationfield = human \enspace resources(1)\\ & \\ -1.76682, \quad for \enspace educationfield = life \enspace sciences(2)\\ & \\ -0.53888, \quad for \enspace educationfield = marketing(3)\\ & \\ -2.07068, \quad for \enspace educationfield = medical(4)\\ & \\ -3.11794, \quad for \enspace educationfield = other(5)\\ & \\ -0.13154, \quad for \enspace educationfield = technical \enspace degree(6)\\ \end{cases} \]


\[\beta_{15}maritalstatus = \begin{cases} 0, \quad for \enspace maritalstatus = single(1)\\ & \\ -1.60791, \quad for \enspace maritalstatus = married(2)\\ & \\ -1.68362, \quad for \enspace maritalstatus = divorced(3)\\ \end{cases} \]


\[\beta_{19}overtime = \begin{cases} 0, \quad for \enspace overtime = no(1)\\ & \\ 3.13669, \quad for \enspace overtime = yes(2)\\ \end{cases} \]

Searchable Data Table

Interpretations

Column

Continuous/numerical variable (scrollable)

  • For every one-year increase in \(age\) the estimated odds of \(attrition = Yes\) increases by a multiplicative factor of \(e^{-0.36314} = 0.6955\), meaning that there’s a 30.5% decrease in the odds of \(attrition = Yes\), holding all other variables fixed. Furthermore, since there is a quadratic (squared) term for \(age\) we see that the linear effect of \(age\) is not constant (i.e., a linear slope). Instead, we see that the slope for \(age\) changes at each additional year of age. We now see that the effect of age on the estimated odds of \(attrition = Yes\) decreases, initially, is minimized at age 43, and after age 43 the estimated odds of \(attrition = Yes\) increases, holding all other variables fixed. See the following plot

  • For every one-dollar/day increase in \(dailyrate\) the estimated odds of \(attrition = Yes\) increases by a multiplicative factor of \(e^{-0.00064} = 0.9994\), meaning that there’s a 0.0006% decrease in the odds of \(attrition = Yes\), holding all other variables fixed.

  • For every 1-mile increase in \(distancefromhome\) the estimated odds of \(attrition = Yes\) increases by a multiplicative factor of \(e^{0.06069} = 1.0626\), meaning that there’s a 6.26% increase in the odds of \(attrition = Yes\), holding all other variables fixed.

  • For every 1-unit increase in \(numcompaniesworked\) the estimated odds of \(attrition = Yes\) increases by a multiplicative factor of \(e^{0.29720} = 1.3461\), meaning that there’s a 34.6% increase in the odds of \(attrition = Yes\), holding all other variables fixed.

  • For every 1-unit increase in \(trainingtimeslastyear\) the estimated odds of \(attrition = Yes\) increases by a multiplicative factor of \(e^{-0.29303} = 0.7460\), meaning that there’s a 25.4% decrease in the odds of \(attrition = Yes\), holding all other variables fixed.

  • For every 1-year increase in \(yearsincurrentrole\) the estimated odds of \(attrition = Yes\) increases by a multiplicative factor of \(e^{-0.20462} = 0.8150\), meaning that there’s a 18.5% decrease in the odds of \(attrition = Yes\), holding all other variables fixed.

  • For every 1-year increase in \(yearssincelastpromotion\) the estimated odds of \(attrition = Yes\) increases by a multiplicative factor of \(e^{0.26748} = 1.3067\), meaning that there’s a 30.7% increase in the odds of \(attrition = Yes\), holding all other variables fixed.

Categorical variables (scrollable)

  • For every 1-unit increase in \(environmentsatisfaction\) rating the estimated odds of \(attrition = Yes\) increases by a multiplicative factor of \(e^{-0.90788} = 0.4034\), meaning that there’s a 59.7% decrease in the odds of \(attrition = Yes\) from the previous rating level, holding all other variables fixed.

  • For every 1-unit increase in \(jobinvolvement\) rating the estimated odds of \(attrition = Yes\) increases by a multiplicative factor of \(e^{-1.19419} = 0.3029\), meaning that there’s a 69.7% decrease in the odds of \(attrition = Yes\) from the previous rating level, holding all other variables fixed.

  • For every 1-unit increase in \(jobsatisfaction\) rating the estimated odds of \(attrition = Yes\) increases by a multiplicative factor of \(e^{-0.42537} = 0.6535\), meaning that there’s a 34.7% decrease in the odds of \(attrition = Yes\) from the previous rating level, holding all other variables fixed.

  • For \(businesstravel = rarely travel\), the estimated odds of \(attrition = Yes\) is \(e^{2.44736} = 11.5578\) times the estimated odds for \(businesstravel = non-travel\). The estimated odds is 1055% greater for the \(businesstravel = rarely travel\) group.

  • For \(businesstravel = travel frequently\), the estimated odds of \(attrition = Yes\) is \(e^{4.06929} = 58.5154\) times the estimated odds for \(businesstravel = non-travel\). The estimated odds is 5752% greater for the \(businesstravel = travel frequently\) group.

  • For \(businesstravel = travel frequently\), the estimated odds of \(attrition = Yes\) is \(e^{4.06929-2.44736} = 5.0629\) times the estimated odds for \(businesstravel = rarely travel\). The estimated odds is 407% greater for the \(businesstravel = travel frequently\) group.

  • For \(educationfield = life sciences\), the estimated odds of \(attrition = Yes\) is \(e^{-1.76682} = 0.1709\) times the estimated odds for \(educationfield = human resources\). The estimated odds is 82.1% lower for the \(educationfield = life sciences\) group.

  • For \(educationfield = marketing\), the estimated odds of \(attrition = Yes\) is \(e^{-0.53888} = 0.5834\) times the estimated odds for \(educationfield = human resources\). The estimated odds is 46.2% lower for the \(educationfield = marketing\) group.

  • For \(educationfield = medical\), the estimated odds of \(attrition = Yes\) is \(e^{-2.07068} = 0.1261\) times the estimated odds for \(educationfield = human resources\). The estimated odds is 87.4% lower for the \(educationfield = medical\) group.

  • For \(educationfield = other\), the estimated odds of \(attrition = Yes\) is \(e^{-3.11794} = 0.0442\) times the estimated odds for \(educationfield = human resources\). The estimated odds is 95.6% lower for the \(educationfield = other\) group.

  • For \(educationfield = technical degree\), the estimated odds of \(attrition = Yes\) is \(e^{-0.13154} = 0.8767\) times the estimated odds for \(educationfield = human resources\). The estimated odds is 12.3% lower for the \(educationfield = technical degree\) group.

  • For \(maritalstatus = married\), the estimated odds of \(attrition = Yes\) is \(e^{-1.60791} = 0.2003\) times the estimated odds for \(maritalstatus = single\). The estimated odds is 80% lower for the \(maritalstatus = married\) group.

  • For \(maritalstatus = divorced\), the estimated odds of \(attrition = Yes\) is \(e^{-1.68362} = 0.1857\) times the estimated odds for \(maritalstatus = single\). The estimated odds is 81.4% lower for the \(maritalstatus = divorced\) group.

  • For \(overtime = yes\), the estimated odds of \(attrition = Yes\) is \(e^{3.13669} = 23.0275\) times the estimated odds for \(overtime = no\). The estimated odds is 2202% greater for the \(overtime = yes\) group.

Column

Age Effect

Observations

Variables with no information value
* \(employeecount\) - only one unique value; each “1” represents a single employee
* \(over18\) - only one unique value; all employees are \(geq\) 18 years old
* \(standardhours\) - only one unique value; each employee works a standard 80-hr work week over a two-week period
* \(employeenumber\) - value represents an indexing method to identify each employee

Variables not impacting the model & its outcome
* Independence tests during logistic regression modeling indicated that the following variables had no relationship with the response variable. These variables were excluded from the model due to variable independence:
+ \(gender\)
+ \(relationshipsatisfaction\)
+ \(worklifebalance\)

Variables removed due to colinearity
* The following variables were removed during logistic regression modeling because they were highly correlated with other variables in the model:
+ \(department\)
+ \(joblevel\)
+ \(jobrole\)
+ \(montlyincome\)
+ \(yearsatcompany\)

Categorical predictor combination contributing to highest estimated \(attrition = Yes\)
* \(businesstravel = travel frequently\)
* \(educationfield = human resources\)
* \(jobinvolvement = low\)
* \(jobsatisfaction = low\)
* \(maritalstatus = single\)
* \(overtime = yes\)

Categorical predictor combination contributing to lowest estimated \(attrition = Yes\)
* \(businesstravel = never\)
* \(educationfield = other\)
* \(jobinvolvement = very high\)
* \(jobsatisfaction = very high\)
* \(maritalstatus = divorced\)
* \(overtime = no\)

Variables affecting attrition the most
* \(numcompaniesworked\) and \(yearssincelastpromotion\) each lead to an increase in the odds of attrition as the variable value increases
* \(trainingtimeslastyear\), \(yearsincurrentrole\), \(enviornmentsatisfaction\), \(jobinvolvement\), and \(jobsatisfaction\) each lead to a notable decrease as the variable value increases

Post-modeling Exploration

Attrition By Department & Gender

Findings & recommendations

Column

Major Findings

  • The highest number of employees that attrited were in the Research and Development department followed by the Sales department.

  • In decending order, \(numcompaniesworked\) and \(yearssincelastpromotion\) each, individually, have the greatest increase effect on odds of attrition

  • In decending order, \(jobinvolvement\), \(environmentsatisfaction\), \(jobsatisfaction\), \(trainingtimeslastyear\), and \(yearsincurrentrole\) each, individually, have the greatest decrease effect on odds of attrition

  • Individually, the effect of \(age\) decreases odds of attrition every year from ages 18-42. At age 43 the effect of \(age\) is minimized. Afterwards, beginning at age 44, the effect of \(age\) increases odds of attrition.

  • The odds of attrition will be substantially lower for married or divorced employees than it is for single employees.

  • The odds of attrition will be lower for each educationfield category compared to those with an education field of human resources.

  • The odds of attrition will be greater for both frequent or rare business travelers compared to non-travlers.

Column

Recommendations

  • Focus on R&D department first, Sales department second

  • Based on data exploration and model findings, consider initially focusing efforts on employees with:

    • single
    • 18-30 years old
    • < 3 companies previously worked for
    • 0-3 training times last year
    • < 4 years in current role
    • < 3 years since last promotion
    • those who work overtime
    • rare business travelers
    • life sciences, marketing, and/or medical education fields

Potential strategy

  1. Look at employee placement first.
    • Should some employees move to a different department?
    • Would they be happier? More engaged?
    • Are they currently in the department/role that is an optimal fit?
  2. Provide adequate and appropriate training
    • Employees may not feel that they are getting enough of the right training
  3. Reduce and/or re-align travel according to job role & department
    • Some employees may need to travel more to fully accomplish various tasks
    • Other employees may feel that they travel too much
  4. Address overtime
    • Can overtime be reduced?
    • Are temporary or seasonal hires needed?
    • Can overly aggressive deadlines be extended? What race is trying to be won?
    • Eliminate redundant or unnecessary job process requirements
  • Aggregated satisfaction ratings (counts) may indicate some success with implemented changes.

  • Review status quarterly or semi-annually.

Other considerations

Column

Other helpful data

  • Include attrition categories - i.e., instead of grouping by \(attrition = yes\, or \, no\), consider expanding the number of categories/reasons for attrition, such as: quit, fire, resign, retire, death, medical, relocate, etc.

  • Clarify meanings of \(dailyrate\), \(hourlyrate\), \(monthlyrate\) and/or how they related to \(monthlyincome\)

  • Should we assume that \(dailyrate\), \(hourlyrate\), \(monthlyrate\) represent an employee’s salary? If so, shouldn’t they be consitent? (ie, assume 8 hr workday - 8 * \(hourlyrate\) should equal \(dailyrate\), etc.)

  • Does \(monthlyincome\) represent employee salary before or after deducting taxes & contributions (i.e., income, Social Security, medical/vision/dental insurance, etc.)

  • Amount of overtime (i.e., number of hours of overtime worked, which day of the week overtime was worked, was overtime worked during/on a holiday and which one, etc.) may provide more information/insight better than ‘yes’ or ‘no’ responses.

  • The type and amount of training received last year may be more informative and provide better insight (i.e., online, seminar, webinar, brown-bag, formal class, class at outside formal institution [also, online, blended, traditional], etc.)

  • Exclusive of what the model indicates, compare data to relevant HR/employment requirements - is the data representative of meeting or not meeting certain state or federal employment guidelines/requirements? Diversity comes to mind. If certain requirements are not being met, then the fulfillment of those requirements could cause change in the model and what insights it leads to.

Column

Other things to try and/or explore

  • Look at a comparison of the misclassified instances from the test set vs. the instances with high leverage in the training set. Are there similarities or differences? Anything that might indicate what’s causing the misclassification?

  • GAM model - to try a smoothing, non-linear model

  • Incorporate/use SQL (via \(sqldf\) package) to compare outlier instances vs. highleverage instances found in the training set. How are they similar? How are the different?

  • Incorporate/use SQL to compare misclassified instances from the test set vs. the outlier and/or high leverage instances in the training set. How are they similar? How are the different? This could be informative about why those instances in the test set were misclassified.

  • Discretize, or group, select predictor variables, such as \(age\), \(distancefromhome\), \(dailyrate\), \(hourlyrate\), and/or \(monthlyincome\).

  • Bootstrap or randomly sample instances from the data set to add additional instances/observations to the data set to balance the response variable
    • is this an acceptable practice?
    • how would the model change or be different (at least for logistic regression)?
  • Other possible models
    • AdaBoost
    • Neural net(s) - simple, CNN, RNN, etc.
    • Survival analysis

Final thoughts

---
title: "Kaggle Data - IBM HR Analytics"
author: "G. Conway"
output: 
  flexdashboard::flex_dashboard:
    orientation: column
    vertical_layout: fill
    theme: yeti
    source_code: embed
---

```{r initial_packages, include = FALSE}
library(tidyverse)
library(DataExplorer)
library(knitr)
library(kableExtra)
library(pander)
```

```{r read_load_data, include = FALSE}
data <- read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
```

Welcome 
=======================================================================

Column {data-width=600}
-----------------------------------------------------------------------

### Why this project 

* Get back into the books and notes to refresh on concepts and software

* Refresh on and practice using R/RStudio

* Experiment with flexdashboards to see if I might want/how to incorporate this as part a workflow process

* Work with a relevant data set 

* Learning new things about flexdashboards in R/Rmarkdown (Ex: using html for picture sizing & placement)  
**Image Source:**  https://images.techhive.com/images/article/2016/09/data_science_classes-100682563-large.jpg

Column {data-width=400} ----------------------------------------------------------------------- ### Important Note(s) * This is best viewed on a wide-screen monitor. * After opening this file, expand or maximize the window to properly view it. * A small, or reduced, window size causes the top tabs to move to a second line in the header row. This collapses the page contents in a manner that hides various window/section headers, etc. ### Experimental section This section demonstrates showing a code block without the result(s). ```{r, echo = T, eval = F} a <- 2 + 2 b <- function(x){ print(x^2) } b(a) rm(a, b) # clean memory ``` About The Data {data-navmenu="Data Exploration"} ======================================================================= Column ----------------------------------------------------------------------- ### Data Source IBM HR Analytics Employee Attrition & Performance Downloaded from: https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset * Data is fictional - created by IBM data scientists * Insight considerations (general): + Predict attrition + What factors contribute to attrition + Once identify factors contributing to attrition, deep-dive and/or comparisons to develop understanding of those factors ```{r initial_data_clean} # make a copy of the data data_copy <- data # remove the original data to avoid accidentally altering it rm(data) # change all variable names to lower case colnames(data_copy) <- tolower(colnames(data_copy)) # change all character values to lower case data_copy <- data_copy %>% mutate_if(is.character, str_to_lower) # change attrition to binary variable data_copy$attrition <- ifelse(data_copy$attrition == "yes", 1, 0) # factor the remaining categorical variables data_copy$businesstravel <- factor(data_copy$businesstravel, levels = c("non-travel", "travel_rarely", "travel_frequently")) data_copy$department <- factor(data_copy$department) data_copy$educationfield <- factor(data_copy$educationfield) data_copy$gender <- factor(data_copy$gender) data_copy$jobrole <- factor(data_copy$jobrole) data_copy$maritalstatus <- factor(data_copy$maritalstatus, levels = c("single", "married", "divorced")) data_copy$over18 <- factor(data_copy$over18) data_copy$overtime <- factor(data_copy$overtime) ``` ### Data Sample (scrollable) ```{r data_sample} head(data_copy, 5) %>% kable(format = "html", escape = F) %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left") ``` Column ----------------------------------------------------------------------- ### Data Overview ```{r data_overview} tempdf <- t(as.data.frame(introduce(data_copy))) colnames(tempdf) <- "value" tempdf %>% kable(format = "html", escape = F) %>% kable_styling(bootstrap_options = "striped", full_width = T, position = "left") rm(tempdf) ``` ### Data Dictionary (scrollable) **_$attrition$_** 0 'No' 1 'Yes' **_$education$_** 1 'Below College' 2 'College' 3 'Bachelor' 4 'Master' 5 'Doctor' **_$environmentsatisfaction$_** 1 'Low' 2 'Medium' 3 'High' 4 'Very High' **_$jobinvolvement$_** 1 'Low' 2 'Medium' 3 'High' 4 'Very High' **_$joblevel$_** Ordinal levels represented by 1, 2, 3, 4, 5. No further meaning known. **_$jobsatisfaction$_** 1 'Low' 2 'Medium' 3 'High' 4 'Very High' **_$performancerating$_** 1 'Low' 2 'Good' 3 'Excellent' 4 'Outstanding' **_$relationshipsatisfaction$_** 1 'Low' 2 'Medium' 3 'High' 4 'Very High' **_$stockoptionlevel$_** Ordinal levels represented by 0, 1, 2, 3. No further meaning known. **_$worklifebalance$_** 1 'Bad' 2 'Good' 3 'Better' 4 'Best'
```{r data_dictionary} tempdf <- as.data.frame(cbind(unlist(sapply(data_copy, levels)))) colnames(tempdf) <- "meaning" tempdf %>% rownames_to_column("factor_level") %>% kable(format = "html", escape = F) %>% kable_styling(bootstrap_options = "striped", full_width = T) rm(tempdf) # to clean memory ``` Missing Values {data-navmenu="Data Exploration"} ======================================================================= **_Missing Values_** ```{r missing values, fig.height = 8.25, fig.width = 8} plot_missing(data_copy) ``` Feature Distributions {data-navmenu="Data Exploration"} ======================================================================= Column{.tabset .tabset-fade} ----------------------------------------------------------------------- ### Continuous data distributions ```{r num_data_distro, fig.width = 8} plot_histogram(data_copy, nrow = 3, ncol = 3) ``` ### Discrete data distributions ```{r discrete_data_distro, fig.width = 8} plot_bar(data_copy, nrow = 3, ncol = 3) ``` Correlations {data-navmenu="Data Exploration"} ======================================================================= Column{.tabset .tabset-fade} ----------------------------------------------------------------------- ### All-Data Correlations ```{r all_correlations, fig.height = 8.25, fig.width = 8} plot_correlation(data_copy, "all") # first dummifies all categories & then # computes correlations for discrete features ``` ### Continuous Data Correlations ```{r num_correlations, fig.height = 8.25, fig.width = 8} plot_correlation(data_copy, "continuous") ``` ### Discrete Data Correlations ```{r discrete_correlations, fig.height = 8.25, fig.width = 8} plot_correlation(data_copy, "discrete") # first dummifies all categories & then # computes correlations ``` Initial Observations/Notes ======================================================================= Column ----------------------------------------------------------------------- **_Notes_** * Using a 70/30 train/test split for assessing model performance * Models to explore: Logistic regression (manual and step-wise), Sparse logistic regression * Although, $joblevel$ and $stockoptionlevel$ show as numbers, they represent distinct, ordered levels. As such, we will leave the values as quantitative values but will view them from the perspective of ordinal variables during interpretations for this analysis. * Refer to **_Data Exploration_** $\rightarrow$ **_About The Data_** $\rightarrow$ **_Data Dictionary_** section to see which variables were factored and their corresponding factor levels. * $maritalstatus$ was factored as an ordinal variable
**_Observations_** * The following variables/predictors are not needed for analysis: + $employeecount$ --> each observation represent a single employee + $over18$ --> all employees are over 18 + $standardhours$ --> has only 1 unique value (80) + $employeenumber$ --> not needed; simply a means of referencing * There are no missing values; therefore, no imputation or removal of instances is required * Multicolinearity observed - high correlations involving the following continuous variables could affect model: + $age$ $\rightarrow$ $joblevel$, $monthlyincome$, $totalworkingyears$ and $yearsatcompany$ + $joblevel$ $\rightarrow$ $monthlyincome$, $totalworkingyears$ and $yearsatcompany$ + $monthlyincome$ $\rightarrow$ $totalworkingyears$ and $yearsatcompany$ + $percentsalaryhike$ $\rightarrow$ $performancerating$ + $totalworkingyears$ $\rightarrow$ $yearsatcompany$ + $yearsatcompany$ $\rightarrow$ $yearsincurrentrole$, $yearssincelastpromotion$ and $yearswithcurrmanager$ * High correlations among categorical data levels will be ignored initially. I'm choosing to do this because: + I am not expanding the data set to include dummy variables for each category level. Doing so will increase the dimensionality. + The variable selection process may resolve the issue. * The data is unbalanced on the response variable $attrition$ ```{r attrition_balance} kable(table(data_copy$attrition), col.names = c("", "Freq")) %>% kable_styling(full_width = F, position = "left") ``` * In his book _An Introduction to Categorical Data Analysis (2nd Ed.)_, Agresti discusses a guideline that suggests that there "...ideally be at least 10 outcomes of each type for every predictor." This guideline indicates that there should be **no more** than 23-24 predictors in our final logistic regression model. ```{r drop_unused_vars} # drop unused variables data_copy <- subset(data_copy, select = -c(employeecount, over18, standardhours, employeenumber)) ``` ```{r create_train_test_sets} # set seed to recreate sampling in future code runs set.seed(259) # number of instances/observations that should be in each set num_train <- 0.7*nrow(data_copy) num_test <- 0.3*nrow(data_copy) # create indicators all_index <- sample(1:nrow(data_copy), size = nrow(data_copy), replace = F) train_index <- all_index[1:num_train] test_index <- which(!(c(1:nrow(data_copy)) %in% train_index)) # create data sets train <- data_copy[train_index, ] test <- data_copy[test_index, ] # clean up memory rm(num_train, num_test, all_index, train_index, test_index) ``` Saturated (Full) Model {data-navmenu="Logistic Regression"} ======================================================================= **_Saturated (Full) Model_** \begin{align} logit[P(attrition = 1 (Yes))] = &\beta_0 + \beta_1age + \beta_2businesstravel + \\ &\beta_3dailyrate + \beta_4department + \beta_5distancefromhome + \beta_6education + \\ &\beta_7educationfield + \beta_8environmentsatisfaction + \beta_9gender + \beta_{10}hourlyrate + \\ &\beta_{11}jobinvolvement + \beta_{12}joblevel + \beta_{13}jobrole + \beta_{14}jobsatisfaction + \\ &\beta_{15}maritalstatus + \beta_{16}monthlyincome + \beta_{17}monthlyrate + \beta_{18}numcompaniesworked + \\ &\beta_{19}overtime + \beta_{20}percentsalaryhike + \beta_{21}performancerating + \\ &\beta_{22}relationshipsatisfaction + \beta_{23}stockoptionlevel + \\ &\beta_{24}totalworkingyears + \beta_{25}trainingtimeslastyear + \\ &\beta_{26}worklifebalance + \beta_{27}yearsatcompany + \beta_{28}yearsincurrentrole + \\ &\beta_{29}yearssincelastpromotion + \beta_{30}yearswithcurrmanager \end{align} * **Note: at this point we begin working with the training data.** * Before further refining the logistic regression model, let's look at the summary for the saturated model. Red indicates statistical significance ($p-value < 0.05$). ```{r full_model} # get the full model and look at it mod_full <- glm(attrition ~ age + businesstravel + dailyrate + department + distancefromhome + education + educationfield + environmentsatisfaction + gender + hourlyrate + jobinvolvement + joblevel + jobrole + jobsatisfaction + maritalstatus + monthlyincome + monthlyrate + numcompaniesworked + overtime + percentsalaryhike + performancerating + relationshipsatisfaction + stockoptionlevel + totalworkingyears + trainingtimeslastyear + worklifebalance + yearsatcompany + yearsincurrentrole + yearssincelastpromotion + yearswithcurrmanager, family = binomial(link = "logit"), data = train) as.data.frame(round(coef(summary(mod_full)), 5)) %>% rownames_to_column("var") %>% mutate( `Pr(>|z|)` = cell_spec(`Pr(>|z|)`, "html", color = ifelse(`Pr(>|z|)` < 0.05, "red", "black")) ) %>% column_to_rownames("var") %>% kable(format = "html", escape = F) %>% kable_styling(bootstrap_options = "striped", full_width = F) ``` * We have a lot of coefficients that are not significant and we have not yet addressed the colinearity in the data. * To address the colinearity, we'll look at the variance inflation factors of the model. But first, let's check for independence between the categorical response vs. each categorical predictor. In those cases where we find that the response variable does not depend on/have a relationship with a predictor we will remove that predictor from the model. Check X vs. Y Independence {data-navmenu="Logistic Regression"} ======================================================================= Column ----------------------------------------------------------------------- ### Check X vs. Y Independence * Here, we'll check to see if there's a relationship between ($H_o: No\ relationship/independent)$) each categorical variable and the response variable using contingency tables, $\chi^2$, and $p-values$. Where contingency tables have an ordinal variable w/ attrition (nominal - 2 levels) we will use the Cochran-Mantel-Haenszel (CMH) test, a linear trend test, since it will have more power. **The results of the CMH test will be directly beneath the corresponding CrossTable and indicated by d.f. = 1.**
* **Use $\chi^2$ test for independence for the following predictors w/ attrition:** + $businesstravel$ + $department$ + $educationfield$ + $gender$ + $jobrole$ + $maritalstatus$ + $overtime$
* **CMH test for the remaining categorical predictors vs. attrition**
* The following predictors are independent of the response (i.e. $p-value > 0.05$). We will remove these predictors from the model since the response does not depend on these predictors. + $gender$ + $relationshipsatisfaction$ + $worklifebalance$ Column ----------------------------------------------------------------------- ### $\chi^2$ tests (scrollable) ```{r} # NOTE: the descr package can be used to generate CrossTables and get Chi-sq # results which works well when using pander() to knit tables. # attach the train data to be able to use the variables names in R functions attach(train) library(gmodels) # to use CrossTable() CrossTable(businesstravel, attrition, expected = T, prop.r = F, prop.c = F, prop.t = F, prop.chisq = F, chisq = T) ```


```{r} CrossTable(department, attrition, expected = T, prop.r = F, prop.c = F, prop.t = F, prop.chisq = F, chisq = T) ```


```{r} CrossTable(educationfield, attrition, expected = T, prop.r = F, prop.c = F, prop.t = F, prop.chisq = F, chisq = T) ```


```{r} CrossTable(gender, attrition, expected = T, prop.r = F, prop.c = F, prop.t = F, prop.chisq = F, chisq = T) ```


```{r} CrossTable(jobrole, attrition, expected = T, prop.r = F, prop.c = F, prop.t = F, prop.chisq = F, chisq = T) ```


```{r} CrossTable(maritalstatus, attrition, expected = T, prop.r = F, prop.c = F, prop.t = F, prop.chisq = F, chisq = T) ```


```{r} CrossTable(overtime, attrition, expected = T, prop.r = F, prop.c = F, prop.t = F, prop.chisq = F, chisq = T) ``` ### CMH tests (scrollable) ```{r} # CrossTable for education vs. attrition is only to check expected cell counts CrossTable(education, attrition, digits = 0, expected = T, prop.r = F, prop.c = F, prop.t = F, prop.chisq = F) # use CMH test for education vs. attrition tab <- table(education, attrition) vcdExtra::CMHtest(tab)$table[1, ] ```


```{r} # CrossTable for environmentsatisfaction vs. attrition is only to check expected cell counts CrossTable(environmentsatisfaction, attrition, digits = 0, expected = T, prop.r = F, prop.c = F, prop.t = F, prop.chisq = F) # use CMH test for environmentsatisfaction vs. attrition tab <- table(environmentsatisfaction, attrition) vcdExtra::CMHtest(tab)$table[1, ] ```


```{r} # CrossTable for jobinvolvement vs. attrition is only to check expected cell counts CrossTable(jobinvolvement, attrition, digits = 0, expected = T, prop.r = F, prop.c = F, prop.t = F, prop.chisq = F) # use CMH test for jobinvolvement vs. attrition tab <- table(jobinvolvement, attrition) vcdExtra::CMHtest(tab)$table[1, ] ```


```{r} # CrossTable for joblevel vs. attrition is only to check expected cell counts CrossTable(joblevel, attrition, digits = 0, expected = T, prop.r = F, prop.c = F, prop.t = F, prop.chisq = F) # use CMH test for joblevel vs. attrition tab <- table(joblevel, attrition) vcdExtra::CMHtest(tab)$table[1, ] ```


```{r} # CrossTable for jobsatisfaction vs. attrition is only to check expected cell counts CrossTable(jobsatisfaction, attrition, digits = 0, expected = T, prop.r = F, prop.c = F, prop.t = F, prop.chisq = F) # use CMH test for jobsatisfaction vs. attrition tab <- table(jobsatisfaction, attrition) vcdExtra::CMHtest(tab)$table[1, ] ```


```{r} # CrossTable for relationshipsatisfaction vs. attrition is only to check expected cell counts CrossTable(relationshipsatisfaction, attrition, digits = 0, expected = T, prop.r = F, prop.c = F, prop.t = F, prop.chisq = F) # use CMH test for relationshipsatisfaction vs. attrition tab <- table(relationshipsatisfaction, attrition) vcdExtra::CMHtest(tab)$table[1, ] ```


```{r} # CrossTable for stockoptionlevel vs. attrition is only to check expected cell counts CrossTable(stockoptionlevel, attrition, digits = 0, expected = T, prop.r = F, prop.c = F, prop.t = F, prop.chisq = F) # use CMH test for stockoptionlevel vs. attrition tab <- table(stockoptionlevel, attrition) vcdExtra::CMHtest(tab)$table[1, ] ```


```{r} # CrossTable for worklifebalance vs. attrition is only to check expected cell counts CrossTable(worklifebalance, attrition, digits = 0, expected = T, prop.r = F, prop.c = F, prop.t = F, prop.chisq = F) # use CMH test for worklifebalance vs. attrition tab <- table(worklifebalance, attrition) vcdExtra::CMHtest(tab)$table[1, ] ``` ```{r} # clean memory rm(tab) ``` Check VIFs {data-navmenu="Logistic Regression"} ======================================================================= **_Model m2 showing the removal of the three predictors from the previous section_** \begin{align} logit[P(attrition = 1 (Yes))] = &\beta_0 + \beta_1age + \beta_2businesstravel + \\ &\beta_3dailyrate + \beta_4department + \beta_5distancefromhome + \beta_6education + \\ &\beta_7educationfield + \beta_8environmentsatisfaction + \beta_{10}hourlyrate + \\ &\beta_{11}jobinvolvement + \beta_{12}joblevel + \beta_{13}jobrole + \beta_{14}jobsatisfaction + \\ &\beta_{15}maritalstatus + \beta_{16}monthlyincome + \beta_{17}monthlyrate + \beta_{18}numcompaniesworked + \\ &\beta_{19}overtime + \beta_{20}percentsalaryhike + \beta_{21}performancerating + \\ &\beta_{23}stockoptionlevel + \beta_{24}totalworkingyears + \beta_{25}trainingtimeslastyear + \\ &\beta_{27}yearsatcompany + \beta_{28}yearsincurrentrole + \\ &\beta_{29}yearssincelastpromotion + \beta_{30}yearswithcurrmanager \end{align} * Check the variance inflation factors of the reduced model (m2) to identify additional potential variables to remove ```{r model_m2} # clean memory rm(mod_full) m2 <- glm(attrition ~ age + businesstravel + dailyrate + department + distancefromhome + education + educationfield + environmentsatisfaction + hourlyrate + jobinvolvement + joblevel + jobrole + jobsatisfaction + maritalstatus + monthlyincome + monthlyrate + numcompaniesworked + overtime + percentsalaryhike + performancerating + stockoptionlevel + totalworkingyears + trainingtimeslastyear + yearsatcompany + yearsincurrentrole + yearssincelastpromotion + yearswithcurrmanager, family = binomial(link = "logit"), data = train) ``` ```{r vif_full} # look at the variance inflation factors to check multicolinearity library(car) vif_result_m2 <- as.data.frame(vif(m2)) kable(vif_result_m2) %>% kable_styling(bootstrap_options = "striped", full_width = F) ``` * The VIF table generated doesn't give clear direction. Instead, take the square of column 3 of vif_result_m2 & apply standard VIF rules of thumb. NOTE: see https://stats.stackexchange.com/questions/70679/which-variance-inflation-factor-should-i-be-using-textgvif-or-textgvif for discussion on use of GVIF. ```{r gvif_use, fig.width=8} vif_result_m2$GVIF_SQ_measure <- vif_result_m2$`GVIF^(1/(2*Df))`^2 # view VIF-related results in a plot vif_result_m2$variable <- rownames(vif_result_m2) ggplot(data = vif_result_m2, aes(x = variable, y = GVIF_SQ_measure)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90, vjust = -0.005, hjust = 0)) # this zooms in on the plot ggplot(data = vif_result_m2, aes(x = variable, y = GVIF_SQ_measure)) + geom_bar(stat = "identity") + ylim(0, 20) + theme(axis.text.x = element_text(angle = 90, vjust = -0.005, hjust = 0)) rm(vif_result_m2) ``` * Remove the following variables with a GVIF_SQ_measure > 5. The predictors identified as having a high VIF measure correspond with correlations noted in the heatmaps generated in the **_Data Exploration_** $\rightarrow$ **_Correlations_** section. + $department$ + $joblevel$ + $jobrole$ + $monthlyincome$ + $yearsatcompany$ Reduced Model (m2) {data-navmenu="Logistic Regression"} ======================================================================= Column ----------------------------------------------------------------------- ### Reduced model m2 * Now that we've identified an initial set of variables to remove, we arrive at a reduced model (m2) in the form of: \begin{align} logit[P(attrition = 1 (Yes))] = &\beta_0 + \beta_1age + \beta_2businesstravel + \\ &\beta_3dailyrate + \beta_5distancefromhome + \beta_6education + \\ &\beta_7educationfield + \beta_8environmentsatisfaction + \beta_{10}hourlyrate + \\ &\beta_{11}jobinvolvement + \beta_{14}jobsatisfaction + \\ &\beta_{15}maritalstatus + \beta_{17}monthlyrate + \beta_{18}numcompaniesworked + \\ &\beta_{19}overtime + \beta_{20}percentsalaryhike + \beta_{21}performancerating + \\ &\beta_{23}stockoptionlevel + \beta_{24}totalworkingyears + \beta_{25}trainingtimeslastyear + \\ &\beta_{28}yearsincurrentrole + \\ &\beta_{29}yearssincelastpromotion + \beta_{30}yearswithcurrmanager \end{align} ```{r reduced_m2} m2 <- glm(attrition ~ age + businesstravel + dailyrate + distancefromhome + education + educationfield + environmentsatisfaction + hourlyrate + jobinvolvement + jobsatisfaction + maritalstatus + monthlyrate + numcompaniesworked + overtime + percentsalaryhike + performancerating + stockoptionlevel + totalworkingyears + trainingtimeslastyear + yearsincurrentrole + yearssincelastpromotion + yearswithcurrmanager, family = binomial(link = "logit"), data = train) ``` * Does the model (m2) fit? ```{r M2_fit_summary} as.data.frame(round(coef(summary(m2)), 5)) %>% rownames_to_column("var") %>% mutate( `Pr(>|z|)` = cell_spec(`Pr(>|z|)`, "html", color = ifelse(`Pr(>|z|)` < 0.05, "red", "black")) ) %>% column_to_rownames("var") %>% kable(format = "html", escape = F) %>% kable_styling(bootstrap_options = "striped", full_width = F) data.frame("Residual Deviance" = m2$deviance, "Residual df" = m2$df.residual) %>% kable(format = "html", escape = F) %>% kable_styling(full_width = F) ``` * We see that model M2 fits by $\frac{deviance_{res}}{df_{res}} \leq 1$. Check the marginal model plots to check model validity and to see if any of the continuous predictors are misspecified. If a predictor is misspecified, check the conditional density plots to see what kind of transformation may be needed. Column ----------------------------------------------------------------------- ### Marginal Model Plots (scrollable) * Check the marginal model plots (mmp) of the quantitative predictors to see if the model is specified correctly ```{r mmp_m2} # load the car library to use the mmp() function library(car) attach(train) # check the marginal model plots of the quantitative variables for model validity # and indicators that particular variables need to be transformed mmp(m2, m2$linear.predictors) mmp(m2, age) mmp(m2, dailyrate) mmp(m2, distancefromhome) mmp(m2, hourlyrate) mmp(m2, monthlyrate) mmp(m2, numcompaniesworked) mmp(m2, percentsalaryhike) mmp(m2, trainingtimeslastyear) mmp(m2, yearsincurrentrole) mmp(m2, yearssincelastpromotion) mmp(m2, yearswithcurrmanager) ``` ### Conditional Density Plots (scrollable) * $age$ appears to be misspecified. ```{r conditional_dens_plots_m2} boxplot(age ~ attrition, ylab = "age", xlab = "attrition") plot(density(age[attrition == 0], bw = "SJ", kern = "gaussian"), type = "l", xlab = "age", main = "") rug(age[attrition == "no"]) lines(density(age[attrition == 1], bw = "SJ", kern = "gaussian"), type = "l", xlab = "age", main = "") rug(age[attrition == "yes"]) ``` * $age$ appears to generally have a normal distribution and the same/similar variance for both values of $attrition$ (i.e., yes and no). Let's try adding a quadratic term to the model for $age$. Reduced Model (m3) {data-navmenu="Logistic Regression"} ======================================================================= Column ----------------------------------------------------------------------- ### Model m3 * Here we will include a quadratic term in the model for $age$ and arrive at model (m3) in the form of: \begin{align} logit[P(attrition = 1 (Yes))] = &\beta_0 + \beta_1age + \beta_{1a}age^2 + \beta_2businesstravel + \\ &\beta_3dailyrate + \beta_5distancefromhome + \beta_6education + \\ &\beta_7educationfield + \beta_8environmentsatisfaction + \beta_{10}hourlyrate + \\ &\beta_{11}jobinvolvement + \beta_{14}jobsatisfaction + \\ &\beta_{15}maritalstatus + \beta_{17}monthlyrate + \beta_{18}numcompaniesworked + \\ &\beta_{19}overtime + \beta_{20}percentsalaryhike + \beta_{21}performancerating + \\ &\beta_{23}stockoptionlevel + \beta_{24}totalworkingyears + \beta_{25}trainingtimeslastyear + \\ &\beta_{28}yearsincurrentrole + \\ &\beta_{29}yearssincelastpromotion + \beta_{30}yearswithcurrmanager \end{align} ```{r model_m3} # clean memory rm(m2) m3 <- glm(attrition ~ age + I(age^2) + businesstravel + dailyrate + distancefromhome + education + educationfield + environmentsatisfaction + hourlyrate + jobinvolvement + jobsatisfaction + maritalstatus + monthlyrate + numcompaniesworked + overtime + percentsalaryhike + performancerating + stockoptionlevel + totalworkingyears + trainingtimeslastyear + yearsincurrentrole + yearssincelastpromotion + yearswithcurrmanager, family = binomial(link = "logit"), data = train) ``` * Does the model (m3) fit? ```{r m3_fit_summary} as.data.frame(round(coef(summary(m3)), 5)) %>% rownames_to_column("var") %>% mutate( `Pr(>|z|)` = cell_spec(`Pr(>|z|)`, "html", color = ifelse(`Pr(>|z|)` < 0.05, "red", "black")) ) %>% column_to_rownames("var") %>% kable(format = "html", escape = F) %>% kable_styling(bootstrap_options = "striped", full_width = F) data.frame("Residual Deviance" = m3$deviance, "Residual df" = m3$df.residual) %>% kable(format = "html", escape = F) %>% kable_styling(full_width = F) ``` * We see that model m3 fits by $\frac{deviance_{res}}{df_{res}} \leq 1$. Check the marginal model plots. Column ----------------------------------------------------------------------- ### Marginal Model Plots (scrollable) * Check the marginal model plots (mmp) of the quantitative predictors to see if the model is specified correctly ```{r mmp_m3} mmp(m3, m3$linear.predictors) mmp(m3, age) mmp(m3, dailyrate) mmp(m3, distancefromhome) mmp(m3, hourlyrate) mmp(m3, monthlyrate) mmp(m3, numcompaniesworked) mmp(m3, percentsalaryhike) mmp(m3, trainingtimeslastyear) mmp(m3, yearsincurrentrole) mmp(m3, yearssincelastpromotion) mmp(m3, yearswithcurrmanager) ``` ### Observations on model m3 * Adding the $age^2$ term corrects some misspecification in the model. * Look at the standardized deviance residuals and inspect for outliers and bad leverage points Leverage (Model m3) {data-navmenu="Logistic Regression"} ======================================================================= Column ----------------------------------------------------------------------- ### Leverage Plot ```{r leverage_m3} hvalues <- influence(m3)$hat stanresDeviance <- residuals(m3) / sqrt(1-hvalues) levcutoff <- 2 * mean(hvalues) highlevs <- which(hvalues > levcutoff) ``` ```{r leverage_plot} qplot(hvalues, stanresDeviance) + xlab("Leverage Values") + ylab("Standardized Deviance Residuals") + geom_vline(xintercept = levcutoff, color = "steel blue") + geom_point(aes(hvalues[highlevs], stanresDeviance[highlevs], color = "high leverage")) + geom_point(aes(hvalues[stanresDeviance > 2], stanresDeviance[stanresDeviance > 2], color = "outlier (SDR > 2)")) + labs(color = "") ``` ### Observations * There doesn't appear to be any bad leverage points, although there are several points of high leverage. * There does appear to be outliers in the data. * Since this is a simulated dataset, we will assume that there is sufficient reason for removing the outliers. Column ----------------------------------------------------------------------- ### Number of outliers ```{r num_outliers, comment=NA} outliers <- c(which(stanresDeviance > 2), which(stanresDeviance < -2)) names(outliers) <- NULL pander(length(outliers)) ``` ### Outlier indices ```{r outlier_indices, comment=NA} pander(outliers) ``` ### Outlier data (scrollable) ```{r outlier_data} train[outliers, ] %>% kable(format = "html", escape = F) %>% kable_styling(full_width = F, bootstrap_options = "striped") ``` ```{r} # clean memory rm(hvalues, levcutoff, stanresDeviance) ``` Outliers Removed (Model m3) {data-navmenu="Logistic Regression"} ======================================================================= ```{r remove outliers} train_minus_outliers <- train[-outliers, ] ``` Column ----------------------------------------------------------------------- ### Model m3 w/o outliers ```{r m3_no_outliers} m3 <- glm(attrition ~ age + I(age^2) + businesstravel + dailyrate + distancefromhome + education + educationfield + environmentsatisfaction + hourlyrate + jobinvolvement + jobsatisfaction + maritalstatus + monthlyrate + numcompaniesworked + overtime + percentsalaryhike + performancerating + stockoptionlevel + totalworkingyears + trainingtimeslastyear + yearsincurrentrole + yearssincelastpromotion + yearswithcurrmanager, family = binomial(link = "logit"), data = train_minus_outliers) ``` * Does the model fit? ```{r m3_summary_no_outliers} as.data.frame(round(coef(summary(m3)), 5)) %>% rownames_to_column("var") %>% mutate( `Pr(>|z|)` = cell_spec(`Pr(>|z|)`, "html", color = ifelse(`Pr(>|z|)` < 0.05, "red", "black")) ) %>% column_to_rownames("var") %>% kable(format = "html", escape = F) %>% kable_styling(bootstrap_options = "striped", full_width = F) data.frame("Residual Deviance" = m3$deviance, "Residual df" = m3$df.residual) %>% kable(format = "html", escape = F) %>% kable_styling(full_width = F) ``` * We see that model m3 without outliers in the data still fits by $\frac{deviance_{res}}{df_{res}} \leq 1$. Check the marginal model plots. Column ----------------------------------------------------------------------- ### Marginal Model Plots (scrollable) * Check the marginal model plots (mmp) of the quantitative predictors to see if the model is specified correctly ```{r mmp_m3_no_outliers} # detach the original train data set detach(train) # attach the train set that does not have the outliers attach(train_minus_outliers) mmp(m3, m3$linear.predictors) mmp(m3, age) mmp(m3, dailyrate) mmp(m3, distancefromhome) mmp(m3, hourlyrate) mmp(m3, monthlyrate) mmp(m3, numcompaniesworked) mmp(m3, percentsalaryhike) mmp(m3, trainingtimeslastyear) mmp(m3, yearsincurrentrole) mmp(m3, yearssincelastpromotion) mmp(m3, yearswithcurrmanager) ``` ### Observations on model m3 * After specifying the predictors correctly earlier, we saw that the linear fit of the model could, potentially, still improve. Therefore, we inspected the leverage and looked for outliers. * We identified 34 outliers in the training set, and, for the purpose of this analysis, removed those outliers under the assumption that there was sufficient reason to do so. Recall, this is a simulated data set. * After removing the outliers from the training set and re-running model m3 with the new data, we find that the overall linear fit of the model improved greatly. * There still appears to be several predictors that are not significant, though. Let's run step-wise variable selection to see if we can further reduce the number of predictors in the model. Variable Selection (Model m3) {data-navmenu="Logistic Regression"} ======================================================================= Column ----------------------------------------------------------------------- ### Fwd/Bwd Stepwise Selection ```{r fwd_bwd_stepwise, results="hide"} # NOTE: here, we hide the results of the code chunk because it won't display # fully in the final knitted dashboard library(MASS) step.m3 <- stepAIC(m3, direction = "both") ``` ```{r stepwise_formula_results} # View results of final fwd/bwd stepwise selection step.m3$anova ``` Column ----------------------------------------------------------------------- ### Observations * Step-wise variable selection on model m3 removes seven variables. * All seven variables removed from the model were not previously significant. * Interestingly, $dailyrate$ was not removed from the model even though it was not statistically significant before. Final Logistic Regr Model {data-navmenu="Logistic Regression"} ======================================================================= ### Final Logistic Regression Model \begin{align} logit[P(attrition = 1 (Yes))] = &\beta_0 + \beta_1age + \beta_{1a}age^2 + \beta_2businesstravel + \\ &\beta_3dailyrate + \beta_5distancefromhome + \\ &\beta_7educationfield + \beta_8environmentsatisfaction + \\ &\beta_{11}jobinvolvement + \beta_{14}jobsatisfaction + \\ &\beta_{15}maritalstatus + \beta_{18}numcompaniesworked + \\ &\beta_{19}overtime + \beta_{24}totalworkingyears + \beta_{25}trainingtimeslastyear + \\ &\beta_{28}yearsincurrentrole + \\ &\beta_{29}yearssincelastpromotion \end{align} ```{r final_logistic_model} # clean memory rm(m3, step.m3) final_logistic_mod <- glm(attrition ~ age + I(age^2) + businesstravel + dailyrate + distancefromhome + educationfield + environmentsatisfaction + jobinvolvement + jobsatisfaction + maritalstatus + numcompaniesworked + overtime + totalworkingyears + trainingtimeslastyear + yearsincurrentrole + yearssincelastpromotion, family = binomial(link = "logit"), data = train_minus_outliers) ``` ```{r final_logistic_mod_summary} as.data.frame(round(coef(summary(final_logistic_mod)), 5)) %>% rownames_to_column("var") %>% mutate( `Pr(>|z|)` = cell_spec(`Pr(>|z|)`, "html", color = ifelse(`Pr(>|z|)` < 0.05, "red", "black")) ) %>% column_to_rownames("var") %>% kable(format = "html", escape = F) %>% kable_styling(bootstrap_options = "striped", full_width = F) data.frame("Residual Deviance" = final_logistic_mod$deviance, "Residual df" = final_logistic_mod$df.residual) %>% kable(format = "html", escape = F) %>% kable_styling(full_width = F) ``` SLR Model {data-navmenu="Sparse Logistic Regression"} ======================================================================= Column ----------------------------------------------------------------------- ```{r create_matrix} # - keep only the relevant variables from the training set # - start with the same variables as model m3 (NOT the final logistic regr model) # - NOTE 1: we added the Age^2 term to the data for consistency with the final # logistic regression model # - NOTE 2: cv.glmnet() requires the data to be in a matrix. data.matrix() seems # to work best for converting the data to a matrix from a data frame xtrain_copy <- train_minus_outliers[, -c(2, 5, 10, 13, 14, 17, 23, 27, 28)] xtrain_copy$age_sq <- xtrain_copy$age^2 xtrain_copy <- xtrain_copy[, c(1, 23, 2:22)] xtrain_copy <- data.matrix(xtrain_copy) ytrain_copy <- train_minus_outliers$attrition ``` ### Notes * Before applying the Lasso method, we consider the following: + $gender$, $relationshipsatisfaction$ and $worklifebalance$ are removed from the model because of their independence from the response variable ($attrition$). See **_Logistic Regression_** $\rightarrow$ **_Check X vs. Y Independence_**. + The 31 outliers identified in section **Logistic Regression** $\rightarrow$ **Leverage** were removed from the training data set prior to reduce the effects of outliers on the model. + $department$, $joblevel$, $jobrole$, $monthlyincome$, and $yearsatcompany$ were removed because of high VIF/GVIF values. See **_Logistic Regression_** $\rightarrow$ **_Check VIFs_**. + The quadratic term $age^2$ is added to the model because we saw earlier that $age$ is misspecified in the model. See **_Logistic Regression_** $\rightarrow$ **_Reduced Model(m2)_** and **_Reduced Model(m3)_** sections. + For the remaining predictor terms, we will apply the Lasso method for variable selection. + The starting model for SLR (lasso) is the same as model m3. See **_Logistic Regression_** $\rightarrow$ **_Reduced Model(m3)_**.
Model before applying SLR (lasso) \begin{align} logit[P(attrition = 1 (Yes))] = &\beta_0 + \beta_1age + \beta_{1a}age^2 + \beta_2businesstravel + \\ &\beta_3dailyrate + \beta_5distancefromhome + \beta_6education + \\ &\beta_7educationfield + \beta_8environmentsatisfaction + \beta_{10}hourlyrate + \\ &\beta_{11}jobinvolvement + \beta_{14}jobsatisfaction + \\ &\beta_{15}maritalstatus + \beta_{17}monthlyrate + \beta_{18}numcompaniesworked + \\ &\beta_{19}overtime + \beta_{20}percentsalaryhike + \beta_{21}performancerating + \\ &\beta_{23}stockoptionlevel + \beta_{24}totalworkingyears + \beta_{25}trainingtimeslastyear + \\ &\beta_{28}yearsincurrentrole + \\ &\beta_{29}yearssincelastpromotion + \beta_{30}yearswithcurrmanager \end{align} ### Sparse Logistic Regression (SLR) Model Fit ```{r sparse_logistic_regression_fit} library(glmnet) glmnetFit <- cv.glmnet(x = xtrain_copy, y = ytrain_copy, alpha = 1, family = "binomial") plot(glmnetFit) ``` Column ----------------------------------------------------------------------- ### SLR Coefficients ```{r SLR_coef} tempdf <- as.data.frame(as.matrix( cbind(coef(glmnetFit, s = "lambda.min"), coef(glmnetFit, s = "lambda.1se")))) colnames(tempdf) <- c("coef_for_lambda_min", "coef_for_lambda_1se") tempdf %>% rownames_to_column("var") %>% mutate( `coef_for_lambda_min` = cell_spec(`coef_for_lambda_min`, "html", color = ifelse(abs(`coef_for_lambda_min`) < 0.01, "red", "black")) ) %>% mutate( `coef_for_lambda_1se` = cell_spec(`coef_for_lambda_1se`, "html", color = ifelse(abs(`coef_for_lambda_1se`) < 0.01, "red", "black")) ) %>% column_to_rownames("var") %>% kable(format = "html", escape = F) %>% kable_styling(bootstrap_options = "striped", full_width = F) # clean memory rm(tempdf, xtrain_copy, ytrain_copy) ``` Logistic Regression Comparisons {data-navmenu="Sparse Logistic Regression"} ======================================================================= Column ----------------------------------------------------------------------- ### Final Logistic Regr Coefficients by Step-wise Var Select ```{r final_logistic_regr_coefs} as.data.frame(round(coef(summary(final_logistic_mod)), 5)) %>% rownames_to_column("var") %>% mutate( `Pr(>|z|)` = cell_spec(`Pr(>|z|)`, "html", color = ifelse(`Pr(>|z|)` < 0.05, "red", "black")) ) %>% column_to_rownames("var") %>% kable(format = "html", escape = F) %>% kable_styling(bootstrap_options = "striped", full_width = F) ``` Column ----------------------------------------------------------------------- ### Final Logistic Regr Coef by Lasso Var Select ```{r final_SLR_coefs} tempdf <- as.data.frame(as.matrix( cbind(coef(glmnetFit, s = "lambda.min"), coef(glmnetFit, s = "lambda.1se")))) colnames(tempdf) <- c("coef_for_lambda_min", "coef_for_lambda_1se") tempdf %>% rownames_to_column("var") %>% mutate( `coef_for_lambda_min` = cell_spec(`coef_for_lambda_min`, "html", color = ifelse(abs(`coef_for_lambda_min`) < 0.01, "red", "black")) ) %>% mutate( `coef_for_lambda_1se` = cell_spec(`coef_for_lambda_1se`, "html", color = ifelse(abs(`coef_for_lambda_1se`) < 0.01, "red", "black")) ) %>% column_to_rownames("var") %>% kable(format = "html", escape = F) %>% kable_styling(bootstrap_options = "striped", full_width = F) # clean memory rm(tempdf) ``` RF Model {data-navmenu="Random Forest"} ======================================================================= Column ----------------------------------------------------------------------- ### Random Forest Model * We'll use the whole training set for the Random Forest model. It should be less affected by colinearity. * We'll use the training set to identify important variables to keep and then re-run a more sparse model before measuring performance. * We'll also build two initial models based on: 1) training set w/ outliers and 2) training set w/o outliers * For the plots to the right: + Green - class error for class 1 (i.e., $attrition$ = 1 (yes)) + Red - class error for class 0 (i.e., $attrition$ = 0 (no)) + Black - out of bag error + NOTE: we see lower error for class 0 because there are more "No" responses to learn from in the data ```{r rf_data} # NOTE: you must FACTOR the response variable here so that the RF model can # classify each instance as 0 or 1 xtrain <- train[, -2] ytrain <- factor(train$attrition) xtrain_no_out <- train_minus_outliers[, -2] ytrain_no_out <- factor(train_minus_outliers$attrition) ``` ```{r rf_model} library(randomForest) rfmod <- randomForest(x = xtrain, y = ytrain, ntree = 500, importance = T) rfmod_no_out <- randomForest(x = xtrain_no_out, y = ytrain_no_out, ntree = 500, importance = T) ``` Column ----------------------------------------------------------------------- ### RF Model Plot (w/ outliers) ```{r rfmod_plot} plot(rfmod) ``` ### RF Model Plot (w/o outliers) ```{r rfmod_plot_no_out} plot(rfmod_no_out) ``` Variable Importance (w/ Outliers) {data-navmenu="Random Forest"} ======================================================================= Column ----------------------------------------------------------------------- ### Variable Importance (w/ outliers) - Mean Decr in Accuracy ```{r rf_importance_type_one} var_imp <- importance(rfmod, type = 1) library(plotly) rnames <- rownames(var_imp) values <- var_imp[, 1] names(values) <- NULL plot_ly(as.data.frame(var_imp), x = ~rownames(as.data.frame(var_imp)), y = ~MeanDecreaseAccuracy, type = "bar") %>% layout(xaxis = list(title = "Variable")) ``` ### Variable Importance Plot (w/ outliers) - Mean Decr in Accuracy ```{r} varImpPlot(rfmod, type = 1) ``` Column ----------------------------------------------------------------------- ### Variable Importance (w/ outliers) - Mean Decr in Node Impurity ```{r rf_importance_type_two} var_imp <- importance(rfmod, type = 2) rnames <- rownames(var_imp) values <- var_imp[, 1] names(values) <- NULL plot_ly(as.data.frame(var_imp), x = ~rownames(as.data.frame(var_imp)), y = ~MeanDecreaseGini, type = "bar") %>% layout(xaxis = list(title = "Variable")) ``` ### Variable Importance Plot (w/ outliers) - Mean Decr in Node Impurity ```{r} varImpPlot(rfmod, type = 2) ``` Variable Importance (w/o Outliers) {data-navmenu="Random Forest"} ======================================================================= Column ----------------------------------------------------------------------- ### Variable Importance (w/o outliers) - Mean Decr in Accuracy ```{r rf_importance_no_out_type_one} var_imp <- importance(rfmod_no_out, type = 1) rnames <- rownames(var_imp) values <- var_imp[, 1] names(values) <- NULL plot_ly(as.data.frame(var_imp), x = ~rownames(as.data.frame(var_imp)), y = ~MeanDecreaseAccuracy, type = "bar") %>% layout(xaxis = list(title = "Variable")) ``` ### Variable Importance Plot (w/o outliers) - Mean Decr in Accuracy ```{r} varImpPlot(rfmod_no_out, type = 1) ``` Column ----------------------------------------------------------------------- ### Variable Importance (w/o outliers) - Mean Decr in Node Impurity ```{r rf_importance_no_out_type_two} var_imp <- importance(rfmod_no_out, type = 2) rnames <- rownames(var_imp) values <- var_imp[, 1] names(values) <- NULL plot_ly(as.data.frame(var_imp), x = ~rownames(as.data.frame(var_imp)), y = ~MeanDecreaseGini, type = "bar") %>% layout(xaxis = list(title = "Variable")) ``` ### Variable Importance Plot (w/0 outliers) - Mean Decr in Node Impurity ```{r} varImpPlot(rfmod_no_out, type = 2) # clean memory rm(var_imp, rnames, values) ``` Observations {data-navmenu="Random Forest"} ======================================================================= ### Observations * To compare bar plots, I chose to focus on predictor variables with a Mean Decrease in Accuracy $\geq 5$ for identifying important variables and to find a potentially more sparse model. * Based on the selected cut-off above, we will focus on the following as the most important since they agree for both random forest models, regardless of training data containing or not containing outliers: + $age$ + $environmentsatisfaction$ + $joblevel$ + $jobrole$ + $maritalstatus$ + $monthlyincome$ + $overtime$ + $stockoptionlevel$ + $totalworkingyears$ + $yearsatcompany$ * Recall the variables noted as having correlations (colinearity) from the **_Initial Observations/Notes_** section. Although randomForest can handle data with colinearities, we see here that several correlated variables were given high importance. Particularly, consider the correlated relationships among $age$, $joblevel$, $monthlyincome$, $totalworkingyears$, and $yearsatcompany$. * Let's rerun an RF model on data that does not have the following variables: + $totalworkingyears$ + because a company may or may not know this info + we're assuming that the total number of years a person has been working is irrelevant regarding attrition (i.e., you can quit at any time, you can get another offer at any time, you can be fired at any time...all of which don't necessarily account for how long you've been in the workforce.) + $yearsatcompany$ + this is more of an 'umbrella' measure that can overlap with other variables it's correlated with (ex. $yearswithcurrentmanager$) + correlated with $monthlyincome$ + $monthlyincome$ + it's correlated with $age$ and $joblevel$ + it's reasonable to expect that $monthlyincome$ will be greater with a higher $age$ and/or $joblevel$ ```{r rf_data_repeat} # NOTE: you must FACTOR the response variable here so that the RF model can # classify each instance as 0 or 1 xtrain <- train[, -c(2, 17, 25, 28)] ytrain <- factor(train$attrition) xtrain_no_out <- train_minus_outliers[, -c(2, 17, 25, 28)] ytrain_no_out <- factor(train_minus_outliers$attrition) ``` ```{r rf_model_repeat} rfmod <- randomForest(x = xtrain, y = ytrain, ntree = 1000, importance = T) rfmod_no_out <- randomForest(x = xtrain_no_out, y = ytrain_no_out, ntree = 1000, importance = T) ``` Variable Importance 2 (w/ Outliers) {data-navmenu="Random Forest"} ======================================================================= Column ----------------------------------------------------------------------- ### Variable Importance (w/ outliers) - Mean Decr in Accuracy ```{r rf_importance_type_one_p2} var_imp <- importance(rfmod, type = 1) library(plotly) rnames <- rownames(var_imp) values <- var_imp[, 1] names(values) <- NULL plot_ly(as.data.frame(var_imp), x = ~rownames(as.data.frame(var_imp)), y = ~MeanDecreaseAccuracy, type = "bar") %>% layout(xaxis = list(title = "Variable")) ``` ### Variable Importance Plot (w/ outliers) - Mean Decr in Accuracy ```{r} varImpPlot(rfmod, type = 1) ``` Column ----------------------------------------------------------------------- ### Variable Importance (w/ outliers) - Mean Decr in Node Impurity ```{r rf_importance_type_two_p2} var_imp <- importance(rfmod, type = 2) rnames <- rownames(var_imp) values <- var_imp[, 1] names(values) <- NULL plot_ly(as.data.frame(var_imp), x = ~rownames(as.data.frame(var_imp)), y = ~MeanDecreaseGini, type = "bar") %>% layout(xaxis = list(title = "Variable")) ``` ### Variable Importance Plot (w/ outliers) - Mean Decr in Node Impurity ```{r} varImpPlot(rfmod, type = 2) ``` Variable Importance 2 (w/o Outliers) {data-navmenu="Random Forest"} ======================================================================= Column ----------------------------------------------------------------------- ### Variable Importance (w/o outliers) - Mean Decr in Accuracy ```{r rf_importance_no_out_type_one_p2} var_imp <- importance(rfmod_no_out, type = 1) rnames <- rownames(var_imp) values <- var_imp[, 1] names(values) <- NULL plot_ly(as.data.frame(var_imp), x = ~rownames(as.data.frame(var_imp)), y = ~MeanDecreaseAccuracy, type = "bar") %>% layout(xaxis = list(title = "Variable")) ``` ### Variable Importance Plot (w/o outliers) - Mean Decr in Accuracy ```{r} varImpPlot(rfmod_no_out, type = 1) ``` Column ----------------------------------------------------------------------- ### Variable Importance (w/o outliers) - Mean Decr in Node Impurity ```{r rf_importance_no_out_type_two_p2} var_imp <- importance(rfmod_no_out, type = 2) rnames <- rownames(var_imp) values <- var_imp[, 1] names(values) <- NULL plot_ly(as.data.frame(var_imp), x = ~rownames(as.data.frame(var_imp)), y = ~MeanDecreaseGini, type = "bar") %>% layout(xaxis = list(title = "Variable")) ``` ### Variable Importance Plot (w/0 outliers) - Mean Decr in Node Impurity ```{r} varImpPlot(rfmod_no_out, type = 2) # clean memory rm(var_imp, rnames, values) ``` Observations 2 {data-navmenu="Random Forest"} ======================================================================= ### Observations * Again, to compare bar plots, I chose to focus on predictor variables with a Mean Decrease in Accuracy $\geq 5$ for identifying important variables and to find a potentially more sparse model. * Based on the selected cut-offs above, we will focus on the following as the most important based on the Mean Decrease in Accuracy, of which these variables appeared in both RF model where data contained outliers and the RF model where the data did not contain outliers: + $age$ + $educationfield$ + $environmentsatisfaction$ + $jobinvolvement$ + $joblevel$ + $jobrole$ + $maritalstatus$ + $numcompaniesworked$ + $overtime$ + $stockoptionlevel$ + $yearsincurrentrole$ * Now, let's develop a sparse random forest model that uses only the 11 most important variables we just identified. RF Reduced Model {data-navmenu="Random Forest"} ======================================================================= Column ----------------------------------------------------------------------- ```{r rf_data_reduced} xtrain_red <- train[, c(1, 8, 9, 12, 13, 14, 16, 19, 20, 24, 29)] ytrain_red <- factor(train$attrition) xtrain_no_out_red <- train_minus_outliers[, c(1, 8, 9, 12, 13, 14, 16, 19, 20, 24, 29)] ytrain_no_out_red <- factor(train_minus_outliers$attrition) ``` ```{r rf_model_reduced} rfmod_red <- randomForest(x = xtrain, y = ytrain, ntree = 1000, importance = T) rfmod_no_out_red <- randomForest(x = xtrain_no_out, y = ytrain_no_out, ntree = 1000, importance = T) ``` ### RF Reduced Model Plot (w/ outliers) ```{r rfmod_plot_reduced} plot(rfmod_red) ``` ### RF Reduced Model Plot (w/o outliers) ```{r rfmod_plot_no_out_reduced} plot(rfmod_no_out_red) # clean memory rm(xtrain, xtrain_no_out, xtrain_no_out_red, xtrain_red) rm(ytrain, ytrain_no_out, ytrain_no_out_red, ytrain_red) ``` Model Performance {data-navmenu="Performance"} ======================================================================= ```{r mod_perform_logistic_final} library(AUC) # make predictions #### pred_logistic_final_test <- predict(final_logistic_mod, newdata = test[, c(1, 3:31)], type = "response") # roc #### roc_logistic_final_test <- roc(pred_logistic_final_test, factor(test$attrition)) # AUC calcs #### auc_logistic_final_test <- AUC::auc(roc_logistic_final_test) # classification rates #### fitted_probs_logistic_test <- ifelse(pred_logistic_final_test > 0.5, 1, 0) observed_ytest <- test$attrition # this can be used for all performance calcs below correct_classify_logistic_test <- round(mean(fitted_probs_logistic_test == observed_ytest), 5) misclassify_logistic_test <- round(mean(fitted_probs_logistic_test != observed_ytest), 5) ``` ```{r mod_perform_SLR} # create copies of the data sets with the relevant predictors #### xtest <- test[, -c(2, 5, 10, 13, 14, 17, 23, 27, 28)] xtest$age_sq <- xtest$age^2 xtest <- xtest[, c(1, 23, 2:22)] xtest <- data.matrix(xtest) ytest <- test$attrition # make predictions #### pred_SLR_test_min <- predict(glmnetFit, newx = xtest, s = "lambda.min", type = "response") pred_SLR_test_1se <- predict(glmnetFit, newx = xtest, s = "lambda.1se", type = "response") # roc #### roc_SLR_test_min <- roc(pred_SLR_test_min, factor(ytest)) roc_SLR_test_1se <- roc(pred_SLR_test_1se, factor(ytest)) # AUC calcs #### auc_SLR_test_min <- AUC::auc(roc_SLR_test_min) auc_SLR_test_1se <- AUC::auc(roc_SLR_test_1se) # classification rates #### fitted_probs_SLR_test_min <- ifelse(pred_SLR_test_min > 0.5, 1, 0) fitted_probs_SLR_test_1se <- ifelse(pred_SLR_test_1se > 0.5, 1, 0) correct_classify_SLR_test_min <- round(mean(fitted_probs_SLR_test_min == observed_ytest), 5) correct_classify_SLR_test_1se <- round(mean(fitted_probs_SLR_test_1se == observed_ytest), 5) misclassify_SLR_test_min <- round(mean(fitted_probs_SLR_test_min != observed_ytest), 5) misclassify_SLR_test_1se <- round(mean(fitted_probs_SLR_test_1se != observed_ytest), 5) # clean memory rm(xtest, ytest) ``` ```{r mod_perform_RF} # make predictions #### pred_RF_full_with_outliers_test <- predict(rfmod, newdata = test[, c(1, 3:31)], type = "prob")[, 2] pred_RF_full_no_outliers_test <- predict(rfmod_no_out, newdata = test[, c(1, 3:31)], type = "prob")[, 2] pred_RF_reduced_with_outliers_test <- predict(rfmod_red, newdata = test[, c(1, 3:31)], type = "prob")[, 2] pred_RF_reduced_no_outliers_test <- predict(rfmod_no_out_red, newdata = test[, c(1, 3:31)], type = "prob")[, 2] # roc #### roc_RF_full_with_outliers_test <- roc(pred_RF_full_with_outliers_test, factor(test$attrition)) roc_RF_full_no_outliers_test <- roc(pred_RF_full_no_outliers_test, factor(test$attrition)) roc_RF_reduced_with_outliers_test <- roc(pred_RF_reduced_with_outliers_test, factor(test$attrition)) roc_RF_reduced_no_outliers_test <- roc(pred_RF_reduced_no_outliers_test, factor(test$attrition)) # AUC calcs #### auc_RF_full_with_outliers_test <- AUC::auc(roc_RF_full_with_outliers_test) auc_RF_full_no_outliers_test <- AUC::auc(roc_RF_full_no_outliers_test) auc_RF_reduced_with_outliers_test <- AUC::auc(roc_RF_reduced_with_outliers_test) auc_RF_reduced_no_outliers_test <- AUC::auc(roc_RF_reduced_no_outliers_test) # classification rates #### # NOTE: we can use the same observed_yval and observed_ytest variables from the # mod_perform_logistic_final code chunk fitted_probs_RF_full_with_outliers_test <- ifelse(pred_RF_full_with_outliers_test > 0.5, 1, 0) fitted_probs_RF_full_no_outliers_test <- ifelse(pred_RF_full_no_outliers_test > 0.5, 1, 0) fitted_probs_RF_reduced_with_outliers_test <- ifelse(pred_RF_reduced_with_outliers_test > 0.5, 1, 0) fitted_probs_RF_reduced_no_outliers_test <- ifelse(pred_RF_reduced_no_outliers_test > 0.5, 1, 0) correct_classify_RF_full_with_outliers_test <- round(mean(fitted_probs_RF_full_with_outliers_test == observed_ytest), 5) correct_classify_RF_full_no_outliers_test <- round(mean(fitted_probs_RF_full_no_outliers_test == observed_ytest), 5) correct_classify_RF_reduced_with_outliers_test <- round(mean(fitted_probs_RF_reduced_with_outliers_test == observed_ytest), 5) correct_classify_RF_reduced_no_outliers_test <- round(mean(fitted_probs_RF_reduced_no_outliers_test == observed_ytest), 5) misclassify_RF_full_with_outliers_test <- round(mean(fitted_probs_RF_full_with_outliers_test != observed_ytest), 5) misclassify_RF_full_no_outliers_test <- round(mean(fitted_probs_RF_full_no_outliers_test != observed_ytest), 5) misclassify_RF_reduced_with_outliers_test <- round(mean(fitted_probs_RF_reduced_with_outliers_test != observed_ytest), 5) misclassify_RF_reduced_no_outliers_test <- round(mean(fitted_probs_RF_reduced_no_outliers_test != observed_ytest), 5) ``` ```{r perform_results} # clean memory rm(observed_ytest) # create vectors auc_test_vec <- c(auc_logistic_final_test, auc_SLR_test_min, auc_SLR_test_1se, auc_RF_full_with_outliers_test, auc_RF_full_no_outliers_test, auc_RF_reduced_with_outliers_test, auc_RF_reduced_no_outliers_test) # clean memory rm(auc_logistic_final_test, auc_SLR_test_min, auc_SLR_test_1se, auc_RF_full_with_outliers_test, auc_RF_full_no_outliers_test, auc_RF_reduced_with_outliers_test, auc_RF_reduced_no_outliers_test) correct_classify_test_vec <- c(correct_classify_logistic_test, correct_classify_SLR_test_min, correct_classify_SLR_test_1se, correct_classify_RF_full_with_outliers_test, correct_classify_RF_full_no_outliers_test, correct_classify_RF_reduced_with_outliers_test, correct_classify_RF_reduced_no_outliers_test) # clean memory rm(correct_classify_logistic_test, correct_classify_SLR_test_min, correct_classify_SLR_test_1se, correct_classify_RF_full_with_outliers_test, correct_classify_RF_full_no_outliers_test, correct_classify_RF_reduced_with_outliers_test, correct_classify_RF_reduced_no_outliers_test) misclassify_test_vec <- c(misclassify_logistic_test, misclassify_SLR_test_min, misclassify_SLR_test_1se, misclassify_RF_full_with_outliers_test, misclassify_RF_full_no_outliers_test, misclassify_RF_reduced_with_outliers_test, misclassify_RF_reduced_no_outliers_test) # clean memory rm(misclassify_logistic_test, misclassify_SLR_test_min, misclassify_SLR_test_1se, misclassify_RF_full_with_outliers_test, misclassify_RF_full_no_outliers_test, misclassify_RF_reduced_with_outliers_test, misclassify_RF_reduced_no_outliers_test) # build data frames performance_test_df <- data.frame(cbind(auc_test_vec, correct_classify_test_vec, misclassify_test_vec), stringsAsFactors = F) colnames(performance_test_df) <- c("AUC", "Correct Classification Rate", "Misclassification Rate") rownames(performance_test_df) <- c("Logistic Regr", "SLR (lambda.min)", "SLR (lambda.1se)", "RF (Saturated Model)*", "RF (Saturated Model)", "RF (Reduced Model)*", "RF (Reduced Model)") # clean memory rm(auc_test_vec, correct_classify_test_vec, misclassify_test_vec) ``` Column {data-width=400} ----------------------------------------------------------------------- ### Performance - test set ```{r test_results} performance_test_df %>% rownames_to_column("var") %>% mutate( `AUC` = cell_spec(`AUC`, "html", color = ifelse(`AUC` == max(`AUC`), "red", "black")) ) %>% mutate( `Correct Classification Rate` = cell_spec(`Correct Classification Rate`, "html", color = ifelse(`Correct Classification Rate` == max(`Correct Classification Rate`), "red", "black")) ) %>% column_to_rownames("var") %>% kable(format = "html", escape = F) %>% kable_styling(bootstrap_options = "striped", full_width = F) %>% footnote(symbol = c("Training set contained outliers during model building.")) # clean memory rm(performance_test_df) ``` Column {data-width=600} ----------------------------------------------------------------------- ### Receiver Operating Curves (ROCs) ```{r roc_curves} plot(roc_logistic_final_test, col = "black", main = "ROC Curves") plot(roc_SLR_test_min, col = "blue", add = T) plot(roc_SLR_test_1se, col = "green", add = T) plot(roc_RF_full_with_outliers_test, col = "brown", add = T) plot(roc_RF_full_no_outliers_test, col = "red", add = T) plot(roc_RF_reduced_with_outliers_test, col = "purple", add = T) plot(roc_RF_reduced_no_outliers_test, col = "orange", add = T) legend(0.5, 0.45, legend = c("Logistic Regr", "SLR (lasso) (lambda_min)", "SLR (lasso) (lambda_1se)", "RF_full_w/_outliers", "RF_full_w/o_outliers", "RF_reduced_w/_outliers", "RF_reduced_w/o_outliers"), col = c("black", "blue", "green", "brown", "red", "purple", "orange"), lty = 1, cex = 0.7, text.font = 1) ``` Predictions {data-navmenu="Performance"} ======================================================================= ### Predicted Probabilities & Class ```{r predictions_table} df <- data.frame(cbind(1:nrow(test), test$attrition, pred_logistic_final_test, fitted_probs_logistic_test, pred_SLR_test_min, fitted_probs_SLR_test_min, pred_SLR_test_1se, fitted_probs_SLR_test_1se, pred_RF_full_with_outliers_test, fitted_probs_RF_full_with_outliers_test, pred_RF_full_no_outliers_test, fitted_probs_RF_full_no_outliers_test, pred_RF_reduced_with_outliers_test, fitted_probs_RF_reduced_with_outliers_test, pred_RF_reduced_no_outliers_test, fitted_probs_RF_reduced_no_outliers_test)) colnames(df) <- c("test_data_index", "attrition", "Logistic Regr Prob", "Logistic Regr Class", "SLR (lambda_min) Prob", "SLR (lambda_min) Class", "SLR (lambda_1se) Prob", "SLR (lambda_1se) Class", "RF_full_w/_outliers Prob", "RF_full_w/_outliers Class", "RF_full_w/o_outliers Prob", "RF_full_w/o_outliers Class", "RF_reduced_w/_outliers Prob", "RF_reduced_w/_outliers Class", "RF_reduced_w/o_outliers Prob", "RF_reduced_w/o_outliers Class") df %>% rownames_to_column("var") %>% mutate( `Logistic Regr Class` = cell_spec(`Logistic Regr Class`, "html", background = ifelse(`Logistic Regr Class` != `attrition`, "yellow", "white")) ) %>% mutate( `SLR (lambda_min) Class` = cell_spec(`SLR (lambda_min) Class`, "html", background = ifelse(`SLR (lambda_min) Class` != `attrition`, "yellow", "white")) ) %>% mutate( `SLR (lambda_1se) Class` = cell_spec(`SLR (lambda_1se) Class`, "html", background = ifelse(`SLR (lambda_1se) Class` != `attrition`, "yellow", "white")) ) %>% mutate( `RF_full_w/_outliers Class` = cell_spec(`RF_full_w/_outliers Class`, "html", background = ifelse(`RF_full_w/_outliers Class` != `attrition`, "yellow", "white")) ) %>% mutate( `RF_full_w/o_outliers Class` = cell_spec(`RF_full_w/o_outliers Class`, "html", background = ifelse(`RF_full_w/o_outliers Class` != `attrition`, "yellow", "white")) ) %>% mutate( `RF_reduced_w/_outliers Class` = cell_spec(`RF_reduced_w/_outliers Class`, "html", background = ifelse(`RF_reduced_w/_outliers Class` != `attrition`, "yellow", "white")) ) %>% mutate( `RF_reduced_w/o_outliers Class` = cell_spec(`RF_reduced_w/o_outliers Class`, "html", background = ifelse(`RF_reduced_w/o_outliers Class` != `attrition`, "yellow", "white")) ) %>% column_to_rownames("var") %>% kable(format = "html", escape = F) %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left") #clean memory rm(df) ``` Differences {data-navmenu="Performance"} ======================================================================= Column{.tabset .tabset-fade} ----------------------------------------------------------------------- ### Number misclassified ```{r different_preds} index_misclass_logsitic <- which(fitted_probs_logistic_test != test$attrition) index_misclass_SLR_test_min <- which(fitted_probs_SLR_test_min != test$attrition) index_misclass_SLR_test_1se <- which(fitted_probs_SLR_test_1se != test$attrition) index_misclass_RF_full_with_outliers_test <- which(fitted_probs_RF_full_with_outliers_test != test$attrition) index_misclass_RF_full_no_outliers_test <- which(fitted_probs_RF_full_no_outliers_test != test$attrition) index_misclass_RF_reduced_with_outliers_test <- which(fitted_probs_RF_reduced_with_outliers_test != test$attrition) index_misclass_RF_reduced_no_outliers_test <- which(fitted_probs_RF_reduced_no_outliers_test != test$attrition) # misclassify rollup cat("# of misclassified instances - Logistic Regr:", length(index_misclass_logsitic), sep = " ") cat("# of misclassified instances - SLR(lambda_min):", length(index_misclass_SLR_test_min), sep = " ") cat("# of misclassified instances - SLR(lambda_1se):", length(index_misclass_SLR_test_1se), sep = " ") cat("# of misclassified instances - RF_full_w/_outliers:", length(index_misclass_RF_full_with_outliers_test), sep = " ") cat("# of misclassified instances - RF_full_w/o_outliers:", length(index_misclass_RF_full_no_outliers_test), sep = " ") cat("# of misclassified instances - RF_reduced_w/_outliers:", length(index_misclass_RF_reduced_with_outliers_test), sep = " ") cat("# of misclassified instances - RF_reduced_w/o_outliers:", length(index_misclass_RF_reduced_no_outliers_test), sep = " ") # clean memory rm(fitted_probs_logistic_test, fitted_probs_SLR_test_min, fitted_probs_SLR_test_1se, fitted_probs_RF_full_with_outliers_test, fitted_probs_RF_full_no_outliers_test, fitted_probs_RF_reduced_with_outliers_test, fitted_probs_RF_reduced_no_outliers_test) rm(pred_logistic_final_test, pred_SLR_test_min, pred_SLR_test_1se, pred_RF_full_with_outliers_test, pred_RF_full_no_outliers_test, pred_RF_reduced_with_outliers_test, pred_RF_reduced_no_outliers_test) ``` ### Misclass Agreements (continuous) ```{r misclass_agree_cont, fig.width=8} # misclassifications that agree across ALL models temp <- Reduce(intersect, list(index_misclass_logsitic, index_misclass_SLR_test_min, index_misclass_SLR_test_1se, index_misclass_RF_full_with_outliers_test, index_misclass_RF_full_no_outliers_test, index_misclass_RF_reduced_with_outliers_test, index_misclass_RF_reduced_no_outliers_test)) df <- data.frame(cbind("index" = temp, test[temp, ])) plot_histogram(df, nrow = 3, ncol = 3) ``` ### Misclass Agreements (discrete) ```{r misclass_agree_discrete, fig.width=8} plot_bar(df, nrow = 3, ncol = 3) ``` ### Misclassified (All models agree) * This is a table of the test data instances that were misclassified. These specific instances were misclassified across all models applied. ```{r} cat("# of the SAME missclassified instances that occur across ALL models:", length(temp), sep = " ") ```
```{r misclass_data} df <- data.frame(cbind("index" = temp, test[temp, ])) df %>% kable(format = "html", escape = F) %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left") # clean memory rm(df, temp) ``` Observations & Notes {data-navmenu="Performance"} ======================================================================= ### Observations & Notes * The reduced Random Forest model on data without outliers has the best AUC and classification rate on the test set. * The logistic regression model built on data without outliers has the best correct classfication rate on the test set. * All random forest models had better AUC values than the logistic regression model, but each also had a lower correct classification rate than the logistic regression model. However, the differences were small (classification rate difference between approximately (0.009, 0.011) and AUC differences between approximately (0.0023, 0.0091)). * After doing some reading on-line, disagreements between AUC performance and Correct Classification Rate may occur because of having unbalanced data sets and/or having an accuracy threshold value of 0.5 (which was used in this project). To troubleshoot any of these issues, futher examination is needed of the ROC curves, threshold values used, predicted probabilities (possibly), and/or other performance measures (i.e., sensitivity, specificity, etc.). Most likely, I suspect that the issue of not having both the best AUC with the highest classification rate is related to the data set being unbalanced. * Upon inspecting the ROC curves, we note that the area involving the greatest disagreement between the logistic regression model and the random forest models exists where $1-specificity$ is between $(0.25, 0.5)$. * Given the previous work, if the objective is predicting attrition then applying the reduced/sparse random forest model on data that does not contain outliers and/or correlated predictor variables appears to be the best method to use (of those explored here). * For the purpose of this project/exercise, I want to explore how various predictor variables affect the odds or probability of $attrition = 1 (yes)$. To do so, I am choosing the **_logistic regression model_** to gain further insights from the data. This model is chosen because: + it's more easly interpreted for inference purposes + it has the best correct classification rate + it's AUC difference is is less that 0.01 compared to other models Chosen Model {data-navmenu="Insights"} ======================================================================= ```{r clean_memory} rm(glmnetFit, rfmod, rfmod_no_out, rfmod_red, rfmod_no_out_red, train_minus_outliers, test) rm(roc_logistic_final_test, roc_RF_full_no_outliers_test, roc_RF_full_with_outliers_test, roc_RF_reduced_no_outliers_test, roc_RF_reduced_with_outliers_test, roc_SLR_test_1se, roc_SLR_test_min) ``` ### Chosen model \begin{align} logit[P(attrition = 1 (Yes))] = &11.86865 - 0.36314age + 0.00426age^2 + \beta_2businesstravel \\ &- 0.00064dailyrate + 0.06069distancefromhome + \\ &\beta_7educationfield - 0.90788environmentsatisfaction \\ &- 1.19419jobinvolvement - 0.42537jobsatisfaction + \\ &\beta_{15}maritalstatus + 0.29720numcompaniesworked + \\ &\beta_{19}overtime - 0.20686totalworkingyears \\ &- 0.29303trainingtimeslastyear \\ &- 0.20462yearsincurrentrole + \\ &0.26748yearssincelastpromotion \end{align} where \[\beta_2businesstravel = \begin{cases} 0, \quad for \enspace businesstravel = non-travel(1)\\ & \\ 2.44736, \quad for \enspace businesstravel = travel \enspace rarely(2)\\ & \\ 4.06929, \quad for \enspace businesstravel = travel \enspace frequently(3) \end{cases} \]
\[\beta_7educationfield = \begin{cases} 0, \quad for \enspace educationfield = human \enspace resources(1)\\ & \\ -1.76682, \quad for \enspace educationfield = life \enspace sciences(2)\\ & \\ -0.53888, \quad for \enspace educationfield = marketing(3)\\ & \\ -2.07068, \quad for \enspace educationfield = medical(4)\\ & \\ -3.11794, \quad for \enspace educationfield = other(5)\\ & \\ -0.13154, \quad for \enspace educationfield = technical \enspace degree(6)\\ \end{cases} \]
\[\beta_{15}maritalstatus = \begin{cases} 0, \quad for \enspace maritalstatus = single(1)\\ & \\ -1.60791, \quad for \enspace maritalstatus = married(2)\\ & \\ -1.68362, \quad for \enspace maritalstatus = divorced(3)\\ \end{cases} \]
\[\beta_{19}overtime = \begin{cases} 0, \quad for \enspace overtime = no(1)\\ & \\ 3.13669, \quad for \enspace overtime = yes(2)\\ \end{cases} \] Searchable Data Table {data-navmenu="Insights"} ======================================================================= ```{r add_pred_probs_and_class} # add column for the estimated predicted prob temp <- predict(final_logistic_mod, newdata = data_copy, type = "response") data_copy_w_est_prob_of_attrit <- data_copy data_copy_w_est_prob_of_attrit$est_prob_attrit <- round(temp, 4) * 100 data_copy_w_est_prob_of_attrit <- data_copy_w_est_prob_of_attrit[, c(1,2,32,3:31)] # add indicator column for outliers data_copy_w_est_prob_of_attrit$outlier <- rep(0, nrow(data_copy_w_est_prob_of_attrit)) data_copy_w_est_prob_of_attrit$outlier[outliers] <- "yes" data_copy_w_est_prob_of_attrit$outlier[grep(0, data_copy_w_est_prob_of_attrit$outlier)] <- "no" # add indicator column for high leverage instances data_copy_w_est_prob_of_attrit$highlev <- rep(0, nrow(data_copy_w_est_prob_of_attrit)) data_copy_w_est_prob_of_attrit$highlev[highlevs] <- "yes" data_copy_w_est_prob_of_attrit$highlev[grep(0, data_copy_w_est_prob_of_attrit$highlev)] <- "no" # reorder df data_copy_w_est_prob_of_attrit <- data_copy_w_est_prob_of_attrit[, c(33, 34, 1:32)] # add predicted class data_copy_w_est_prob_of_attrit$pred_class <- ifelse(data_copy_w_est_prob_of_attrit$est_prob_attrit > 0.5, "yes", "no") data_copy_w_est_prob_of_attrit <- data_copy_w_est_prob_of_attrit[, c(1:4, 35, 5:34)] # clean memory rm(temp) ``` ```{r searchable_table} DT::datatable(data_copy_w_est_prob_of_attrit, filter = "top", extensions = "Buttons", options = list(autoWidth = T, pageLength = 10, autoHideNavigation = F, dom = "Bfrtip", buttons = c("copy", "csv", "print"), searchHighlight = T ) ) ``` Interpretations {data-navmenu="Insights"} ======================================================================= Column ----------------------------------------------------------------------- ### Continuous/numerical variable (scrollable) * For every one-year increase in $age$ the estimated odds of $attrition = Yes$ increases by a multiplicative factor of $e^{-0.36314} = 0.6955$, meaning that there's a 30.5% decrease in the odds of $attrition = Yes$, holding all other variables fixed. Furthermore, since there is a quadratic (squared) term for $age$ we see that the linear effect of $age$ is not constant (i.e., a linear slope). Instead, we see that the slope for $age$ changes at each additional year of age. We now see that the effect of age on the estimated odds of $attrition = Yes$ decreases, initially, is minimized at age 43, and after age 43 the estimated odds of $attrition = Yes$ increases, holding all other variables fixed. See the following plot * For every one-dollar/day increase in $dailyrate$ the estimated odds of $attrition = Yes$ increases by a multiplicative factor of $e^{-0.00064} = 0.9994$, meaning that there's a 0.0006% decrease in the odds of $attrition = Yes$, holding all other variables fixed. * For every 1-mile increase in $distancefromhome$ the estimated odds of $attrition = Yes$ increases by a multiplicative factor of $e^{0.06069} = 1.0626$, meaning that there's a 6.26% increase in the odds of $attrition = Yes$, holding all other variables fixed. * For every 1-unit increase in $numcompaniesworked$ the estimated odds of $attrition = Yes$ increases by a multiplicative factor of $e^{0.29720} = 1.3461$, meaning that there's a 34.6% increase in the odds of $attrition = Yes$, holding all other variables fixed. * For every 1-unit increase in $trainingtimeslastyear$ the estimated odds of $attrition = Yes$ increases by a multiplicative factor of $e^{-0.29303} = 0.7460$, meaning that there's a 25.4% decrease in the odds of $attrition = Yes$, holding all other variables fixed. * For every 1-year increase in $yearsincurrentrole$ the estimated odds of $attrition = Yes$ increases by a multiplicative factor of $e^{-0.20462} = 0.8150$, meaning that there's a 18.5% decrease in the odds of $attrition = Yes$, holding all other variables fixed. * For every 1-year increase in $yearssincelastpromotion$ the estimated odds of $attrition = Yes$ increases by a multiplicative factor of $e^{0.26748} = 1.3067$, meaning that there's a 30.7% increase in the odds of $attrition = Yes$, holding all other variables fixed. ### Categorical variables (scrollable) * For every 1-unit increase in $environmentsatisfaction$ rating the estimated odds of $attrition = Yes$ increases by a multiplicative factor of $e^{-0.90788} = 0.4034$, meaning that there's a 59.7% decrease in the odds of $attrition = Yes$ from the previous rating level, holding all other variables fixed. * For every 1-unit increase in $jobinvolvement$ rating the estimated odds of $attrition = Yes$ increases by a multiplicative factor of $e^{-1.19419} = 0.3029$, meaning that there's a 69.7% decrease in the odds of $attrition = Yes$ from the previous rating level, holding all other variables fixed. * For every 1-unit increase in $jobsatisfaction$ rating the estimated odds of $attrition = Yes$ increases by a multiplicative factor of $e^{-0.42537} = 0.6535$, meaning that there's a 34.7% decrease in the odds of $attrition = Yes$ from the previous rating level, holding all other variables fixed. * For $businesstravel = rarely travel$, the estimated odds of $attrition = Yes$ is $e^{2.44736} = 11.5578$ times the estimated odds for $businesstravel = non-travel$. The estimated odds is 1055% greater for the $businesstravel = rarely travel$ group. * For $businesstravel = travel frequently$, the estimated odds of $attrition = Yes$ is $e^{4.06929} = 58.5154$ times the estimated odds for $businesstravel = non-travel$. The estimated odds is 5752% greater for the $businesstravel = travel frequently$ group. * For $businesstravel = travel frequently$, the estimated odds of $attrition = Yes$ is $e^{4.06929-2.44736} = 5.0629$ times the estimated odds for $businesstravel = rarely travel$. The estimated odds is 407% greater for the $businesstravel = travel frequently$ group. * For $educationfield = life sciences$, the estimated odds of $attrition = Yes$ is $e^{-1.76682} = 0.1709$ times the estimated odds for $educationfield = human resources$. The estimated odds is 82.1% lower for the $educationfield = life sciences$ group. * For $educationfield = marketing$, the estimated odds of $attrition = Yes$ is $e^{-0.53888} = 0.5834$ times the estimated odds for $educationfield = human resources$. The estimated odds is 46.2% lower for the $educationfield = marketing$ group. * For $educationfield = medical$, the estimated odds of $attrition = Yes$ is $e^{-2.07068} = 0.1261$ times the estimated odds for $educationfield = human resources$. The estimated odds is 87.4% lower for the $educationfield = medical$ group. * For $educationfield = other$, the estimated odds of $attrition = Yes$ is $e^{-3.11794} = 0.0442$ times the estimated odds for $educationfield = human resources$. The estimated odds is 95.6% lower for the $educationfield = other$ group. * For $educationfield = technical degree$, the estimated odds of $attrition = Yes$ is $e^{-0.13154} = 0.8767$ times the estimated odds for $educationfield = human resources$. The estimated odds is 12.3% lower for the $educationfield = technical degree$ group. * For $maritalstatus = married$, the estimated odds of $attrition = Yes$ is $e^{-1.60791} = 0.2003$ times the estimated odds for $maritalstatus = single$. The estimated odds is 80% lower for the $maritalstatus = married$ group. * For $maritalstatus = divorced$, the estimated odds of $attrition = Yes$ is $e^{-1.68362} = 0.1857$ times the estimated odds for $maritalstatus = single$. The estimated odds is 81.4% lower for the $maritalstatus = divorced$ group. * For $overtime = yes$, the estimated odds of $attrition = Yes$ is $e^{3.13669} = 23.0275$ times the estimated odds for $overtime = no$. The estimated odds is 2202% greater for the $overtime = yes$ group. Column ----------------------------------------------------------------------- ### Age Effect ```{r age_effects} odds_fn <- function(x) {exp(-0.36314*x + 0.00426*x^2)} age <- seq(min(data_copy_w_est_prob_of_attrit$age), max(data_copy_w_est_prob_of_attrit$age), by = 1) odds <- odds_fn(age) age_data <- data.frame(cbind(age, odds)) ggplot(age_data, aes(age, odds)) + geom_line() + geom_vline(xintercept = age_data$age[which(odds == min(odds))], color = "steel blue") # clean memory rm(odds_fn, age, odds, age_data) ``` ### Observations **_Variables with no information value_** * $employeecount$ - only one unique value; each "1" represents a single employee * $over18$ - only one unique value; all employees are $geq$ 18 years old * $standardhours$ - only one unique value; each employee works a standard 80-hr work week over a two-week period * $employeenumber$ - value represents an indexing method to identify each employee **_Variables not impacting the model & its outcome_** * Independence tests during logistic regression modeling indicated that the following variables had no relationship with the response variable. These variables were excluded from the model due to variable independence: + $gender$ + $relationshipsatisfaction$ + $worklifebalance$ **_Variables removed due to colinearity_** * The following variables were removed during logistic regression modeling because they were highly correlated with other variables in the model: + $department$ + $joblevel$ + $jobrole$ + $montlyincome$ + $yearsatcompany$ **_Categorical predictor combination contributing to highest estimated $attrition = Yes$_** * $businesstravel = travel frequently$ * $educationfield = human resources$ * $jobinvolvement = low$ * $jobsatisfaction = low$ * $maritalstatus = single$ * $overtime = yes$ **_Categorical predictor combination contributing to lowest estimated $attrition = Yes$_** * $businesstravel = never$ * $educationfield = other$ * $jobinvolvement = very high$ * $jobsatisfaction = very high$ * $maritalstatus = divorced$ * $overtime = no$ **_Variables affecting attrition the most_** * $numcompaniesworked$ and $yearssincelastpromotion$ each lead to an **increase** in the odds of attrition as the variable value increases * $trainingtimeslastyear$, $yearsincurrentrole$, $enviornmentsatisfaction$, $jobinvolvement$, and $jobsatisfaction$ each lead to a notable **decrease** as the variable value increases Post-modeling Exploration {data-navmenu="Insights"} ======================================================================= ### Attrition By Department & Gender ```{r} # attach the data set copy for use in plotting functions attach(data_copy) # lets's look at each department data_copy[data_copy$attrition == 1, ] %>% ggplot() + geom_bar(aes(gender, fill = businesstravel)) + facet_wrap(~ department) + ggtitle("Attrition = Yes") + labs(fill = "Business Travel") data_copy[data_copy$attrition == 1, ] %>% ggplot() + geom_bar(aes(gender, fill = educationfield)) + facet_wrap(~ department) + ggtitle("Attrition = Yes") + labs(fill = "Education Field") data_copy[data_copy$attrition == 1, ] %>% ggplot() + geom_bar(aes(gender, fill = factor(jobinvolvement))) + facet_wrap(~ department) + ggtitle("Attrition = Yes") + labs(fill = "Job Involvement") data_copy[data_copy$attrition == 1, ] %>% ggplot() + geom_bar(aes(gender, fill = factor(jobsatisfaction))) + facet_wrap(~ department) + ggtitle("Attrition = Yes") + labs(fill = "Job Satisfaction") data_copy[data_copy$attrition == 1, ] %>% ggplot() + geom_bar(aes(gender, fill = maritalstatus)) + facet_wrap(~ department) + ggtitle("Attrition = Yes") + labs(fill = "Marital Status") data_copy[data_copy$attrition == 1, ] %>% ggplot() + geom_bar(aes(gender, fill = overtime)) + facet_wrap(~ department) + ggtitle("Attrition = Yes") + labs(fill = "Overtime") data_copy[data_copy$attrition == 1, ] %>% ggplot() + geom_bar(aes(gender, fill = factor(numcompaniesworked))) + facet_wrap(~ department) + ggtitle("Attrition = Yes") + labs(fill = "# Companies Worked") data_copy[data_copy$attrition == 1, ] %>% ggplot() + geom_bar(aes(gender, fill = factor(trainingtimeslastyear))) + facet_wrap(~ department) + ggtitle("Attrition = Yes") + labs(fill = "Training Times Last Yr") data_copy[data_copy$attrition == 1, ] %>% ggplot() + geom_bar(aes(gender, fill = factor(yearsincurrentrole))) + facet_wrap(~ department) + ggtitle("Attrition = Yes") + labs(fill = "Yrs in Current Role") data_copy[data_copy$attrition == 1, ] %>% ggplot() + geom_bar(aes(gender, fill = factor(yearssincelastpromotion))) + facet_wrap(~ department) + ggtitle("Attrition = Yes") + labs(fill = "Yrs Since Last Promotion") data_copy[data_copy$attrition == 1, ] %>% ggplot() + geom_bar(aes(gender, fill = factor(environmentsatisfaction))) + facet_wrap(~ department) + ggtitle("Attrition = Yes") + labs(fill = "Environment Satisfaction") data_copy[data_copy$attrition == 1, ] %>% ggplot() + geom_bar(aes(gender, fill = factor(distancefromhome))) + facet_wrap(~ department) + ggtitle("Attrition = Yes") + labs(fill = "Distance From Home") data_copy[data_copy$attrition == 1, ] %>% ggplot() + geom_bar(aes(gender, fill = factor(age))) + facet_wrap(~ department) + ggtitle("Attrition = Yes") + labs(fill = "Age") ``` Findings & recommendations {data-navmenu="Insights"} ======================================================================= Column ----------------------------------------------------------------------- ### Major Findings * The highest number of employees that attrited were in the Research and Development department followed by the Sales department. * In decending order, $numcompaniesworked$ and $yearssincelastpromotion$ each, individually, have the greatest **increase** effect on odds of attrition * In decending order, $jobinvolvement$, $environmentsatisfaction$, $jobsatisfaction$, $trainingtimeslastyear$, and $yearsincurrentrole$ each, individually, have the greatest **decrease** effect on odds of attrition * Individually, the effect of $age$ **decreases** odds of attrition every year from ages 18-42. At age 43 the effect of $age$ is **_minimized_**. Afterwards, beginning at age 44, the effect of $age$ **increases** odds of attrition. * The odds of attrition will be substantially **lower** for married or divorced employees than it is for single employees. * The odds of attrition will be **lower** for each educationfield category compared to those with an education field of human resources. * The odds of attrition will be **greater** for both frequent or rare business travelers compared to non-travlers. Column ----------------------------------------------------------------------- ### Recommendations * Focus on R&D department first, Sales department second * Based on data exploration and model findings, consider initially focusing efforts on employees with: + single + 18-30 years old + < 3 companies previously worked for + 0-3 training times last year + < 4 years in current role + < 3 years since last promotion + those who work overtime + rare business travelers + life sciences, marketing, and/or medical education fields ### Potential strategy (1) Look at employee placement first. + Should some employees move to a different department? + Would they be happier? More engaged? + Are they currently in the department/role that is an optimal fit? (2) Provide adequate and appropriate training + Employees may not feel that they are getting enough of the right training (3) Reduce and/or re-align travel according to job role & department + Some employees may need to travel more to fully accomplish various tasks + Other employees may feel that they travel too much (4) Address overtime + Can overtime be reduced? + Are temporary or seasonal hires needed? + Can overly aggressive deadlines be extended? What race is trying to be won? + Eliminate redundant or unnecessary job process requirements * Aggregated satisfaction ratings (counts) may indicate some success with implemented changes. * Review status quarterly or semi-annually. Other considerations {data-navmenu="Insights"} ======================================================================= Column ----------------------------------------------------------------------- ### Other helpful data * Include attrition categories - i.e., instead of grouping by $attrition = yes\, or \, no$, consider expanding the number of categories/reasons for attrition, such as: quit, fire, resign, retire, death, medical, relocate, etc. * Clarify meanings of $dailyrate$, $hourlyrate$, $monthlyrate$ and/or how they related to $monthlyincome$ * Should we assume that $dailyrate$, $hourlyrate$, $monthlyrate$ represent an employee's salary? If so, shouldn't they be consitent? (ie, assume 8 hr workday - 8 * $hourlyrate$ should equal $dailyrate$, etc.) * Does $monthlyincome$ represent employee salary before or after deducting taxes & contributions (i.e., income, Social Security, medical/vision/dental insurance, etc.) * Amount of overtime (i.e., number of hours of overtime worked, which day of the week overtime was worked, was overtime worked during/on a holiday and which one, etc.) may provide more information/insight better than 'yes' or 'no' responses. * The type and amount of training received last year may be more informative and provide better insight (i.e., online, seminar, webinar, brown-bag, formal class, class at outside formal institution [also, online, blended, traditional], etc.) * Exclusive of what the model indicates, compare data to relevant HR/employment requirements - is the data representative of meeting or not meeting certain state or federal employment guidelines/requirements? Diversity comes to mind. If certain requirements are not being met, then the fulfillment of those requirements could cause change in the model and what insights it leads to. Column ----------------------------------------------------------------------- ### Other things to try and/or explore * Look at a comparison of the misclassified instances from the test set vs. the instances with high leverage in the training set. Are there similarities or differences? Anything that might indicate what's causing the misclassification? * GAM model - to try a smoothing, non-linear model * Incorporate/use SQL (via $sqldf$ package) to compare outlier instances vs. highleverage instances found in the training set. How are they similar? How are the different? * Incorporate/use SQL to compare misclassified instances from the test set vs. the outlier and/or high leverage instances in the training set. How are they similar? How are the different? This could be informative about why those instances in the test set were misclassified. * Discretize, or group, select predictor variables, such as $age$, $distancefromhome$, $dailyrate$, $hourlyrate$, and/or $monthlyincome$. * Bootstrap or randomly sample instances from the data set to add additional instances/observations to the data set to balance the response variable + is this an acceptable practice? + how would the model change or be different (at least for logistic regression)? * Other possible models + AdaBoost + Neural net(s) - simple, CNN, RNN, etc. + Survival analysis ```{r} # clean memory rm(list = ls()) ``` Final thoughts {data-navmenu="Insights"} ======================================================================= * In my opinion, using a flexdashboard structure in Rmarkdown is a good way to establish a sound workflow for conducting data analysis. * Flexdashboard templates can be built and saved that are unique to different types of analyses a business/data analyst might perform. Meaning, one can incorporate the relevant sections that may require modeling diagnostics or other things that are unique to that specific modeling method. * For large datasets (~ > 3-5GB) this might not be an optimal option for one's workflow, but for those data sets that are smaller I found this to be a good way for working with data and the corresponding analysis. * Yes, one can think of this as being similar to using jupyter notebooks. However, I have much more control over the design of this product. I chose not to use jupyter notebooks because it has a very linear and sequential feel to it. Instead, I like the ability to switch between tabs/menus to review previous work, etc. Doing so felt more like being able to look back and forth between different sections in a (real) book vs. scrolling up and down on a webpage. Your preference(s) may differ. * Using section and code-chunk headers in Rmarkdown are very helpful for navigating raw code. This should be a routine practice anyway. * Depending on personal/organizational needs, code generating one's dashboard workflow could assist in building relevant Shiny apps. * R/Rstudio has very good versatility. If needed, there are packages that allow one to run SQL, python, and other analytics related coding languages. * Using a workflow tool like this gives an easy method to discuss one's analysis (on-going or finished) with other colleagues. (NOTE: the source code can be embedded with the dashboard.) * Depending on how the dashboard tool is setup, one can give a presentation directly off of it. Or, it can serve as a supporting product during a presentation during the Q&A portion of a presentation.